Systems and methods for visualizing structural variation and phasing information

ABSTRACT

A system for providing structural variation or phasing information is provided. The system accesses a nucleic acid sequence dataset corresponding to a target nucleic acid in a sample. The dataset comprises a header, synopsis, and data section. The data section comprises a plurality of sequencing reads. Each sequencing read comprises a first portion corresponding to a subset of the target nucleic acid and a second portion that encodes an identifier for the sequencing read from a plurality of identifiers. One or more programs in the memory of the system use a microprocessor of the system to provide a haplotype visualization tool that receives a request for structural variation or phasing information from the dataset. The request is evaluated against the synopsis thereby identifying portions of the data section. Structural variation or phasing information is formatted for display in the haplotype visualization tool using the identified portions of the data section.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.14/995,090, which claims priority to U.S. Patent Application No.62/120,873, entitled “Systems and Methods for Visualizing StructuralVariation and Phasing Information,” filed Feb. 25, 2015, and also claimspriority to U.S. Patent Application No. 62/102,926, entitled “Systemsand Methods for Visualizing Structural Variation and PhasingInformation,” filed Jan. 13, 2015, each of which is hereby incorporatedby reference herein in its entirety.

TECHNICAL FIELD

This specification describes technologies relating to visualizingstructural variation and phasing information in nucleic acid sequencingdata.

BACKGROUND

Haplotype assembly from experimental data obtained from human genomessequenced using massively parallelized sequencing methodologies hasemerged as a prominent source of genetic data. Such data serves as acost-effective way of implementing genetics based diagnostics as well ashuman disease study, detection, and personalized treatment.

The long-range information provided by such massively parallelizedsequencing methodologies is disclosed, for example, in U.S. PatentApplication No. 62/072,214, filed Oct. 29, 2014, entitled “Analysis ofNucleic Acid Sequences.” Such techniques greatly facilitate thedetection of large-scale structural variations of the genome, such astranslocations, large deletions, or gene fusions. Other examplesinclude, but are not limited to the sequencing-by-synthesis platform(ILLUMINA), Bentley et al., 2008, “Accurate whole human genomesequencing using reversible terminator chemistry, Nature 456:53-59;sequencing-by-litigation platforms (POLONATOR; ABI SOLiD), Shendure etal., 2005, “Accurate Multiplex Polony Sequencing of an Evolved bacterialGenome” Science 309:1728-1732; pyrosequencing platforms (ROCHE 454),Margulies et al., 2005, “Genome sequencing in microfabricatedhigh-density picoliter reactors,” Nature 437:376-380; andsingle-molecule sequencing platforms (HELICOS HELISCAPE); Pushkarev etal., 2009, “Single-molecule sequencing of an individual human genome,”Nature Biotech 17:847-850, (PACIFIC BIOSCIENCES) Eid et al., “Real-timesequencing form single polymerase molecules,” Science 323:133-138, eachof which is hereby incorporated by reference in its entirety.

The availability of haplotype data spanning large portions of the humangenome, the need has arisen for ways in which to efficiently work withthis data in order to advance the above stated objectives of diagnosis,discovery, and treatment, particularly as the cost of whole genomesequencing for a personal genome drops below $1000. To computationallyassemble haplotypes from such data, it is necessary to disentangle thereads from the two haplotypes present in the sample and infer aconsensus sequence for both haplotypes. Such a problem has been shown tobe NP-hard. See Lippert et al., 2002, “Algorithmic strategies for thesingle nucleotide polymorphism haplotype assembly problem,” Brief.Bionform 3:23-31, which is hereby incorporated by reference.

The assembly view Consed supports visualization of reads obtained fromthe above-identified sequencing methods. See Gordon 1998, “Consed: Agraphical tool for sequencing finishing,” Genome Research 8:198-202.

Another visualization tool is EagleView. See Huang and Marth, 2008,“EagleView: A genome assembly viewer for next-generation sequencingtechnologies,” Genome Research 18:1538-1543.

Still another such viewer is HapEdit. See Kim et al., “HapEdit: anaccuracy assessment viewer for haplotype assembly using massivelyparallel DNA-sequencing technologies.” Nucleic Acids Research, 2011,1-5. HapEdit provides tools for assessing the accuracy of Haplotypeassemblies and permits a user to fit the composition rates of readssequence by numerous different sequencing technologies.

While the above-disclosed programs are each significant advancements intheir own right, they do not adequately address the need in the art fortools for visually assessing structural variants (e.g., deletions,duplications, copy-number variants, insertions, inversions,translocations, long terminal repeats (LTRs), short tandem repeats(STRs), and a variety of other useful characterizations) in sequencingdata.

SUMMARY

Technical solutions (e.g., computing systems, methods, andnon-transitory computer readable storage mediums) for visually assessingstructural variants are provided. With platforms such as those disclosedin U.S. Patent Application No. 62/072,214, filed Oct. 29, 2014, entitled“Analysis of Nucleic Acid Sequences,” which is hereby incorporated byreference, the genome is fragmented and partitioned and barcoded priorto the target identification. Therefore the integrity of the barcodeinformation is maintained across the genome. The barcode information isused to identify potential structural variation breakpoints by detectingregions of the genome that show significant barcode overlap. They arealso used to obtain phasing information.

The following presents a summary of the invention in order to provide abasic understanding of some of the aspects of the invention. Thissummary is not an extensive overview of the invention. It is notintended to identify key/critical elements of the invention or todelineate the scope of the invention. Its sole purpose is to presentsome of the concepts of the invention in a simplified form as a preludeto the more detailed description that is presented later.

One aspect of the present disclosure is a system for providingstructural variation or phasing information over a network connection toa remote client computer. The system comprises one or moremicroprocessors, a persistent memory and a non-persistent memory. Thepersistent memory (e.g., a hard drive) and the non-persistent memory(e.g., RAM memory) collectively store one or more nucleic acid sequencedatasets. Each respective nucleic acid sequencing dataset in the one ormore nucleic acid sequence datasets corresponds to at least one targetnucleic acid in a respective sample in a plurality of samples. Therespective sample is associated with a reference genome of a speciesthat may serve as a benchmark for analysis of the respective sample insome embodiments. For instance, in some embodiments the respectivesample is mapped to the reference genome and the reference genome isthereby used as a template (reference) to parse queries to visualizeportions of the respective sample. For instance, in some embodiments asample is from a human subject. In such instance, a human genome (asopposed to a genome from a different species) serves as the referencegenome and the respective sample is mapped to the human genome. In thisway, requests to visual sequences or sequence variations in certainhuman chromosomes, or portions thereof from the sample, can beinterpreted and handled using the disclosed systems and methods, basedon such mapping to the reference genome.

The respective nucleic acid sequencing dataset comprises (i) a header,(ii) a synopsis, and (iii) a data section. The data section comprises aplurality of aligned sequence reads from the sample and informationabout each variant call made. Advantageously, the data section isextensible and can store additional data. Each respective sequencingread in the plurality of sequencing reads comprises a first portion thatcorresponds to a subset of at least one target nucleic acid in therespective sample and a second portion that encodes a respectiveidentifier for the respective sequencing read in a plurality ofidentifiers. Each respective identifier is independent of the sequenceof the at least one target nucleic acid. Sequencing reads in theplurality of sequencing reads collectively include the plurality ofidentifiers.

The persistent memory and the non-persistent memory further collectivelystore one or more programs that use the one or more microprocessors toprovide a haplotype visualization tool to a client for installation onthe remote client computer. The system receives a request, sent from theclient over a network connection (e.g., Internet), for structuralvariation or phasing information using a first dataset in the one ormore datasets. Responsive to receiving the request, the request isautomatically filtered by performing a method comprising loading theheader and the synopsis of the first dataset into the non-persistentmemory if not already loaded into the non-persistent memory whileretaining the data section in persistent memory. In the method, therequest is compared (analyzed against) the synopsis of the first datasetthereby identifying one or more portions of the data section of thefirst dataset. These one or more identified portions of the data sectionare, in turn, loaded into non-persistent memory. Structural variation orphasing information is formatted for display on the client computerusing the first dataset. Then the formatted structural variation orphasing information is transmitted over the network connection to theclient device for display on the client device.

In some embodiments, the header delineates a plurality of components inthe respective nucleic acid sequencing dataset. In some embodiments theplurality of components comprises two or more components, three or morecomponents, four or more components or five or more components selectedfrom the group consisting of a summary, an index to variant call data, aphase block track, a refseq index track, a gene track, an exon track, anindex to read data, a structural variant dataset track, an index to atarget dataset, and an index to a fragment dataset.

In some embodiments, the plurality of components comprises the summaryand this summary comprises two or more items, three or more items, fouror more items, five or more items, or six or more items in the groupconsisting of: a percentage of known SNPs phased in the respectivenucleic acid sequencing dataset, a longest phase block in the respectivenucleic acid sequencing dataset, a number of unique barcodes used in therespective nucleic acid sequencing dataset, an average fragment lengthin the respective nucleic acid sequencing dataset, a mean of the averagefragment length in the respective nucleic acid sequencing dataset, apercentage of fragments greater than a lower threshold in the respectivenucleic acid sequencing dataset, a fragment length histogram in therespective nucleic acid sequencing dataset, an N50 phase block size inthe respective nucleic acid sequencing dataset, a phase block histogramin the respective nucleic acid sequencing dataset, a number of sequencereads represented by respective the nucleic acid sequencing dataset, amedian insert size in the respective nucleic acid sequencing dataset, amedian depth in the respective nucleic acid sequencing dataset, apercent of the target genome with zero coverage in the respectivenucleic acid sequencing dataset, a mapped reads percentage for therespective nucleic acid sequencing dataset, a PCR duplication percentagefor the respective nucleic acid sequencing dataset, a coverage histogramfor the in the respective nucleic acid sequencing dataset, an identityof a test nucleic acid that forms the basis for the respective nucleicacid sequencing dataset, a genome source for the respective nucleic acidsequencing dataset, a sex of an organism that originated the at leastone test nucleic acid of the respective nucleic acid sequencing dataset,a sex of the organism that originate the respective sample of the in therespective nucleic acid sequencing dataset, a dataset file formatversion of the in the respective nucleic acid sequencing dataset, and apointer to a plurality of structural variant calls made for therespective nucleic acid sequencing dataset. Advantageously, as thisnon-limiting example of the list of information indicates, the disclosednucleic acid sequencing datasets can contain arbitrary bits of metadata(e.g., annotation data) that might be of user interest in along withsequencing data.

In some embodiments, the plurality of components comprises the index tovariant call data that provides a correspondence between respectiveranges of the genome of the species to offsets in the data section wherevariant call data for the respective ranges is found.

In some embodiments, the plurality of components comprises the phaseblock track. The phase block track comprises (i) a dictionary and (ii) atrack data section comprising phase information for one or morechromosomes in the genome of the species. In some embodiments, thedictionary comprises a plurality of names, and for each respective namein the plurality of names, an offset into the track data where recordsfor the corresponding name are found. In some embodiments, the trackdata section comprises a plurality of records and wherein each record inthe plurality of records represents a phase block in the target nucleicacid. In some embodiments, the tract data section is in the JSON fileformat.

In some embodiments, each respective record in the plurality of recordsspecifies (i) a chromosome number corresponding to the respectiverecord, (ii) a position where the phase block starts on the chromosome,(iii) a position where the phase block ends, (iv) a unique name for therecord, and (v) phasing information about the phase block.

In some embodiments, each respective record in the plurality of recordsis represented by a node in a plurality of nodes in a respectiveinterval tree in a plurality of interval trees, and each interval treein the plurality of interval trees represents a chromosome in aplurality of chromosomes for the species. In some such embodiments, anode in the plurality of nodes of a first interval tree in the pluralityof interval trees stores a midpoint of the node, the midpoint of thenode is a position of the midpoint, on the corresponding chromosome, ofthe phase block corresponding to the node, each respective node in theplurality of nodes of the first interval tree has a link to a left childnode, which corresponds to the phase block immediately to the left of(i.e., numerically less than) the phase block represented by therespective node in the genome of the species, each respective node inthe plurality of nodes of the first interval tree has a link to a rightchild node, which corresponds to the phase block immediately to theright of (i.e., numerically greater than) the phase block represented bythe respective node in the genome of the species, each respective nodein the plurality of nodes of the first interval tree has a sorted set ofnodes that represent phase blocks that overlap the midpoint of therespective node sorted by left hand position of such phase block, andeach respective node in the plurality of nodes of the first intervaltree has a sorted set of nodes that represent phase blocks that overlapthe midpoint of the respective node sorted by right hand position ofsuch phase blocks. In some such embodiments, each respective node in theplurality of nodes of the first interval tree further includes a name,which is an offset in the track data section to the record in theplurality of records that contains phase information for the phase blockcorresponding to the respective node.

In some embodiments, the header further comprises the version of thedataset structure used by the nucleic acid sequencing dataset. In someembodiments, the plurality of components comprises the refseq index, andthe refseq index comprises an index of a plurality of molecularvariation identifiers that are called in the sample. In some suchembodiments, each respective molecular variation identifier in theplurality of molecular variation identifiers is dbSNP identifier.

In some embodiments, the plurality of components comprises the genetrack. In such embodiments, the gene track comprises a plurality ofgenes and, for each respective gene in the plurality of genes, a numberof single nucleotide polymorphisms in the respective gene.

Another aspect of the present disclosure provides a system forprocessing program output over a network connection using a localcomputer, where the local computer comprises one or moremicroprocessors, and a memory that stores one or more programs. The oneor more programs use the one or more microprocessors to execute a methodin accordance with a first operating system running on the localcomputer. In the method a first instance of a first program is invoked.Then, there is obtained through the first instance of the first programfrom a user, a login and a password to a user account on a remotecomputer. This is used to log the user into the user account on theremote computer automatically (using the login and the password providedby the first instance of the first program) across a network connectionbetween the local computer and the remote computer. Responsive tosuccessful login on the remote computer, there automatically sent,without human intervention, a second instance of the first programconfigured to auto-install on the remote computer upon transmission tothe remote computer when the remote computer does not already have thefirst program available in the users account. Next, there is receivedfrom the remote computer a request to open a panel within the firstinstance of the first program. The panel is originated by the secondinstance of the first program running on the remote computer. The panelsolicits input from the user for controlling the second instance of thefirst program. Responsive to receiving input from the user forcontrolling the second instance of the first program in the panel on thelocal computer, the input is sent to the second instance of the firstprogram on the remote computer across the network connection (e.g.,wireless or wired connection). Next, there is received, from the remotecomputer across the network connection, output from the second instanceof the first program responsive to the input. This output is displayedat the local computer.

Another aspect of the present disclosure provides a system for viewingnucleic acid sequencing data. The system comprises one or moremicroprocessors and a memory. The memory stores one or more programsthat use the one or more microprocessors to obtain a nucleic acidsequencing dataset corresponding to at least one target nucleic acid ina sample. The nucleic acid sequencing dataset comprises a plurality ofsequencing reads from the sample. Each respective sequencing read in theplurality of sequencing reads comprises a first portion that correspondsto a subset of at least one target nucleic acid in the sample and asecond portion that encodes a respective identifier (e.g., bar code) forthe respective sequencing read in a plurality of identifiers. Eachrespective identifier is independent of the sequence of the at least onetarget nucleic acid. The plurality of sequencing reads collectivelyincludes the plurality of identifiers. A visualization tool isdisplayed. A request is obtained from a user through the visualizationtool. The request specifies a genomic region represented by the nucleicacid sequencing dataset. Responsive to obtaining the request, therequest is parsed by obtaining a plurality of sequencing reads withinthe genomic region from the nucleic acid sequencing dataset. A scanwindow is run against the plurality of sequencing reads thereby creatinga plurality of windows, each respective window of the plurality ofwindows corresponding to a different region of the genomic region andincluding an identity of each identifier of each sequencing read in thedifferent region of the genomic region in the nucleic acid sequencingdataset. A two dimensional heat map that represents each possible windowpair in the plurality of windows is displayed. Each respective windowpair is displayed in the two dimensional heat map as a color selectedfrom a color scheme based upon the number of identifiers in common inthe respective window pair.

Various embodiments of systems, methods and devices within the scope ofthe appended claims each have several aspects, no single one of which issolely responsible for the desirable attributes described herein.Without limiting the scope of the appended claims, some prominentfeatures are described herein. After considering this discussion, andparticularly after reading the section entitled “Detailed Description”one will understand how the features of various embodiments are used.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference in their entiretiesto the same extent as if each individual publication, patent, or patentapplication was specifically and individually indicated to beincorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

The implementations disclosed herein are illustrated by way of example,and not by way of limitation, in the figures of the accompanyingdrawings. Like reference numerals refer to corresponding partsthroughout the drawings.

FIG. 1 is an example block diagram illustrating a computing device inaccordance with some implementations.

FIG. 2 illustrates exemplary constructs in accordance with an embodimentof the present disclosure.

FIG. 3 provides an overview of a nucleic acid sequencing dataset inaccordance with an embodiment of the present disclosure.

FIG. 4 illustrates the data structure of an example phase block trackwithin a nucleic acid sequencing dataset in accordance with someembodiments.

FIG. 5 illustrates an example phase block track in accordance with someembodiments.

FIG. 6 illustrates the data structure of an example gene track inaccordance with some embodiments.

FIGS. 7A and 7B illustrate an example gene track in accordance with someembodiments.

FIG. 8 illustrates the data structure of an example structural variantdataset track within a nucleic acid sequencing dataset in accordancewith some embodiments.

FIG. 9 illustrates an example structural variant dataset track inaccordance with some embodiments.

FIG. 10 illustrates target, fragment and sequence read data within anucleic acid sequencing dataset in accordance with some embodiments.

FIG. 11 illustrates variant call data within a nucleic acid sequencingdataset in accordance with some embodiments.

FIGS. 12A and 12B illustrate a summarization module in a haplotypevisualization tool in accordance with some embodiments.

FIGS. 13A and 13B illustrate a summarization module in a haplotypevisualization tool in accordance with additional embodiments.

FIG. 14A illustrates a screen shot of a phase visualization module in ahaplotype visualization tool in accordance with some embodiments.

FIG. 14B illustrates another screen shot of a phase visualization modulein a haplotype visualization tool in accordance with some embodiments.

FIG. 15 illustrates another screen shot of a phase visualization modulein a haplotype visualization tool in accordance with some embodiments.

FIG. 16 illustrates another screen shot of a phase visualization modulein a haplotype visualization tool in accordance with some embodiments.

FIG. 17 illustrates search function features of a haplotypevisualization tool in accordance with some embodiments.

FIG. 18 illustrates a screen shot of a structural variants module in ahaplotype visualization tool in accordance with some embodiments.

FIG. 19 illustrates another screen shot of a structural variants modulein a haplotype visualization tool in accordance with some embodiments.

FIG. 20 illustrates still another screen shot of a structural variantsmodule in a haplotype visualization tool in accordance with someembodiments.

FIG. 21 illustrates still an additional screen shot of a structuralvariants module in a haplotype visualization tool in accordance withsome embodiments.

FIG. 22 illustrates a screen shot of a read visualization module in ahaplotype visualization tool in accordance with some embodiments.

FIG. 23 illustrates another screen shot of a structural variants modulein a haplotype visualization tool in accordance with some embodiments.

FIG. 24 illustrates another screen shot of a structural variants modulein a haplotype visualization tool in accordance with some embodiments.

FIG. 25 illustrates another screen shot of a structural variants modulein a haplotype visualization tool in accordance with some embodiments.

FIG. 26 illustrates a phase visualization module in a haplotypevisualization tool in accordance with some embodiments.

FIG. 27 illustrates another aspect of a phase visualization module in ahaplotype visualization tool in accordance with some embodiments.

FIG. 28A illustrates another aspect of a phase visualization module in ahaplotype visualization tool in accordance with some embodiments.

FIG. 28B illustrates still another aspect of a phase visualizationmodule in a haplotype visualization tool in accordance with someembodiments.

FIG. 29 illustrates another aspect of a phase visualization module in ahaplotype visualization tool in accordance with some embodiments.

FIG. 30 illustrates another aspect of a phase visualization module in ahaplotype visualization tool in accordance with some embodiments.

FIG. 31 is an example block diagram illustrating a computing system inaccordance with some implementations.

FIG. 32 is an example of a credential challenge for remote initiation ofan instance of a haplotype visualization tool in accordance with thedisclosed embodiments.

FIG. 33 illustrates a structural variants module in a haplotypevisualization tool in accordance with some embodiments in which asequence read filter is turned off.

FIG. 34 illustrates a structural variants module in a haplotypevisualization tool in accordance with some embodiments in which asequence read filter is turned on.

DETAILED DESCRIPTION

Reference will now be made in detail to embodiments, examples of whichare illustrated in the accompanying drawings. In the following detaileddescription, numerous specific details are set forth in order to providea thorough understanding of the present disclosure. However, it will beapparent to one of ordinary skill in the art that the present disclosuremay be practiced without these specific details. In other instances,well-known methods, procedures, components, circuits, and networks havenot been described in detail so as not to unnecessarily obscure aspectsof the embodiments.

It will also be understood that, although the terms first, second, etc.may be used herein to describe various elements, these elements shouldnot be limited by these terms. These terms are only used to distinguishone element from another. For example, a first subject could be termed asecond subject, and, similarly, a second subject could be termed a firstsubject, without departing from the scope of the present disclosure. Thefirst subject and the second subject are both subjects, but they are notthe same subject.

The terminology used in the present disclosure is for the purpose ofdescribing particular embodiments only and is not intended to belimiting of the invention. As used in the description of the inventionand the appended claims, the singular forms “a”, “an” and “the” areintended to include the plural forms as well, unless the context clearlyindicates otherwise. It will also be understood that the term “and/or”as used herein refers to and encompasses any and all possiblecombinations of one or more of the associated listed items. It will befurther understood that the terms “comprises” and/or “comprising,” whenused in this specification, specify the presence of stated features,integers, steps, operations, elements, and/or components, but do notpreclude the presence or addition of one or more other features,integers, steps, operations, elements, components, and/or groupsthereof.

As used herein, the term “if” may be construed to mean “when” or “upon”or “in response to determining” or “in response to detecting,” dependingon the context. Similarly, the phrase “if it is determined” or “if [astated condition or event] is detected” may be construed to mean “upondetermining” or “in response to determining” or “upon detecting (thestated condition or event(” or “in response to detecting (the statedcondition or event),” depending on the context.

The implementations described herein provide various technical solutionsto detect a structural variant (e.g., deletions, duplications,copy-number variants, insertions, inversions, translocations, longterminal repeats (LTRs), short tandem repeats (STRs), and a variety ofother useful characterizations) in sequencing data of a test nucleicacid obtained from a biological sample. Details of implementations arenow described in relation to the Figures.

FIG. 1 is a block diagram illustrating a structural variant and phasingvisualization system 100 in accordance with some implementations. Thedevice 100 in some implementations includes one or more processing unitsCPU(s) 102 (also referred to as processors), one or more networkinterfaces 104, a user interface 106, a memory 112, and one or morecommunication buses 114 for interconnecting these components. Thecommunication buses 114 optionally include circuitry (sometimes called achipset) that interconnects and controls communications between systemcomponents. The memory 112 typically includes high-speed random accessmemory, such as DRAM, SRAM, DDR RAM, ROM, EEPROM, flash memory, CD-ROM,digital versatile disks (DVD) or other optical storage, magneticcassettes, magnetic tape, magnetic disk storage or other magneticstorage devices, other random access solid state memory devices, or anyother medium which can be used to store desired information; andoptionally includes non-volatile memory, such as one or more magneticdisk storage devices, optical disk storage devices, flash memorydevices, or other non-volatile solid state storage devices. The memory112 optionally includes one or more storage devices remotely locatedfrom the CPU(s) 102. The memory 112, or alternatively the non-volatilememory device(s) within the memory 112, comprises a non-transitorycomputer readable storage medium. In some implementations, the memory112 or alternatively the non-transitory computer readable storage mediumstores the following programs, modules and data structures, or a subsetthereof:

-   -   an optional operating system 116, which includes procedures for        handling various basic system services and for performing        hardware dependent tasks;    -   an optional network communication module (or instructions) 118        for connecting the device 100 with other devices, or a        communication network;    -   an optional sequencing read processing module 120 for processing        sequencing reads, including a structural variation determination        sub-module 120 for identifying structural variations in a        genetic sample from a single organism of a species and a phasing        sub-module 124 for identifying the haplotype of each sequencing        read of the genetic sample;    -   one or more nucleic acid sequencing datasets 126, each such        dataset obtained using a genetic sample from a single organism        of a species;    -   gene annotation data, optionally in the form of a gene track        interval tree 128;    -   exon annotation data, optionally in the form of an exon track        interval tree 142;    -   one or more additional sources of annotation data, optionally in        the form of interval trees 146;    -   a haplotype visualization tool 148 for visualizing structural        variation and phasing information in nucleic acid sequencing        data, including any combination of one or more of a        summarization module 150, a phase visualization module 152, a        structural variants (visualization) module 154, and a read        visualization module 156.

In some implementations, the user interface 106 includes an input device(e.g., a keyboard, a mouse, a touchpad, a track pad, and/or a touchscreen) 100 for a user to interact with the system 100 and a display108.

In some implementations, one or more of the above identified elementsare stored in one or more of the previously mentioned memory devices,and correspond to a set of instructions for performing a functiondescribed above. The above identified modules or programs (e.g., sets ofinstructions) need not be implemented as separate software programs,procedures or modules, and thus various subsets of these modules may becombined or otherwise re-arranged in various implementations. In someimplementations, the memory 112 optionally stores a subset of themodules and data structures identified above. Furthermore, in someembodiments, the memory stores additional modules and data structuresnot described above. In some embodiments, one or more of the aboveidentified elements is stored in a computer system, other than that ofsystem 100, that is addressable by system 100 so that system 100 mayretrieve all or a portion of such data when needed.

Although FIG. 1 shows a “structural variation and phasing visualizationsystem 100,” the figure is intended more as functional description ofthe various features which may be present in computer systems than as astructural schematic of the implementations described herein. Inpractice, and as recognized by those of ordinary skill in the art, itemsshown separately could be combined and some items could be separated.

Advantageously, because the nucleic acid sequence datasets 126 are largein typical embodiments (e.g., 1 gigabyte or greater, 5 gigabytes orgreater, or 10 gigabytes or greater), in some embodiments the structuralvariation and phasing visualization system 100 is part of a system thatincludes one or more client devices 3102 that are in electroniccommunication with the structural variation and phasing visualizationsystem 100 of FIG. 1 across a communication network 3106. Such a networktopology allows scientists and other users to use one of several networkbased technologies to run the haplotype visualization tool 148 on system100, which in typical embodiments is a powerful server computer, butview the results on client device 3102, which can be, for example, alaptop computer. Any form of network technology for implementing thisnetwork topology is encompassed within the present disclosure. Forinstance X-windows session forwarding (not shown in FIG. 31) is used insome embodiments. In other embodiments, the Internet (web) is used. Inparticular, a browser application is run on the client device 3102.

The process of running a program on a remote computer (e.g., in system3100, the structural variation and phasing visualization system 100 isconsidered remote) and viewing the results on a client device 3102(e.g., desktop or laptop) is cumbersome. A user must generally (i)install certain parts of the program on their computer 3102 and otherparts on the server 100, (ii) use SSH or firewall software to create aopen network port connecting the two computers (system 3102 to clientdevice 100), and (iii) independently start different parts of theprogram on different systems. For example, the URLblog.trackets.com/2014/05/17/ssh-tunnel-local-and-remote-port-forwardingexplained-with-examples.html,which is hereby incorporated by reference, explains one way of settingup forwarding. As another example, the URLitg.chem.indiana.edu/inc/wiki/software/openssh/200.html explains anotherway of setting up forwarding. The present disclosure incorporates suchtechniques. However, advantageously, in some embodiments, the presentdisclosure affords solutions to the above-disclosed networkingtechniques, which seeks to automate and improve upon the processesdescribed above. Once a user has installed the haplotype visualizationtool 148 on their client device 3102, they only need to provide the tool148 with their credentials (e.g., user-name and password) for the remotecomputer (structural variation and phasing visualization system 100)that has the data and computational facilities to run the haplotypevisualization tool 148. For instance, in some embodiments, referring toFIG. 32, the user running the haplotype visualization tool 148 on client3102 will be provided with the challenge 3200 that includes a query forthe server name or address 3204, the user's name 3206, an optional SSHkey file (to enable encrypted connection) 3208, an optional SSH keypassword 3210, and a work location 3212 on the server. The instance ofthe haplotype visualization tool 148 on their client device 3102 thenconnects to the remote computer 100 and authenticates as the user usingthe provided credentials. Using that connection, it installs thehaplotype visualization tool 148 on the remote computer, starts it, andconfigures any necessary network port forwarding. Once the haplotypevisualization tool has done this, it opens up a new window on the clientdevice 3102 that is “connected” to the haplotype visualization toolrunning on the remote structural variation and phasing visualizationsystem. Of particular note, in such embodiments, the haplotypevisualization tool 148 on the client device 3102 includes in a copy ofitself that is intended to run on the structural variation and phasingvisualization system 100. In some embodiments, the structural variationand phasing visualization system 100 is running a first operating systemand the client device 3102 is running a second operating system. In someembodiments, the first operating system and the second operating systemare the same. In some embodiments, the first operating system and thesecond operating system are different. In some embodiments, the firstoperating system is one of iOS, DARWIN, RTXC, LINUX, UNIX, OS X, orWINDOWS, and the second operating system is other than the firstoperating system and one of iOS, DARWIN, RTXC, LINUX, UNIX, OS X, orWINDOWS. In the disclosed embodiment, the haplotype visualization tool148 running on the client device 3102 copies the archived copy of thehaplotype visualization tool 148 to the structural variation and phasingsystem 100 and installs (if it has not been installed before) during thesetup process. It will be appreciated that the system and methoddisclosed for remote initiation of the haplotype visualization tool 148on a remote computer is applicable to a broad range of applications thatrequire the computational resources of a remote server with theconcomitant visual interface operating on a local computer in order tocontrol such applications and to visualize data and computationalresults in real time or near real time.

Referring once again to FIGS. 1, 31, and 32, one aspect of the presentdisclosure provides a system 3100 for processing program output over anetwork connection 3106 (e.g., wired or wireless) using a local computer3102. The local computer 3102 comprises one or more microprocessors (notshown), and a memory (not shown) that stores one or more programs (e.g.,haplotype visualization tool 148). The one or more programs use the oneor more microprocessors to execute a method in accordance with a firstoperating system running on the local computer. In the method, a firstinstance of a first program is invoked (e.g., a first instance of thehaplotype visualization tool 148 is invoked on a client device 3102).Through the invoked first instance of the first program there isobtained, from a user, a login and a password to a user account on aremote computer (e.g., structural variation and phasing visualizationsystem 100). The user is then logged into the user account on the remotecomputer automatically, using the login and the password provided by thefirst instance of the first program, across a network connection betweenthe local computer and the remote computer (e.g., communication network3106). Responsive to successful login on the remote computer 100, themethod continues by automatically sending, without human intervention, asecond instance of the first program 148 configured to auto-install onthe remote computer 100 upon transmission to the remote computer. Insome embodiments, the remote computer already has the second instance ofthe first program 148 installed and in some such embodiments the secondinstance of the first program is therefore not transmitted to the remotecomputer for installation. Once the second instance of the first programis installed on the remote computer 100, there is received from theremote computer a request to open a panel (not shown). This panel isoriginated by the second instance of the first program running on theremote computer 100. The panel solicits input from the user forcontrolling the second instance of the first program. For instance, insome embodiments this panel is of the form illustrated in any one ofFIG. 12-21. In some embodiments, the panel is simpler, for instancecontaining a prompt for a dataset name or a search query for searchingin a specified dataset. Responsive to receiving input from the user forcontrolling the second instance of the first program in the panel on thelocal computer, the input is sent to the second instance of the firstprogram running on the remote computer 100 across the networkconnection. The remote computer receives across the network connectionthis input and, subsequently, output from the second instance of thefirst program responsive to the input is displayed as output on thelocal computer (e.g. within the first instance of the first program orin a separate web browser).

Referring to FIG. 2, in accordance with the disclosed systems andmethods, a plurality of sequencing reads (not shown in its entirety inFIG. 2) is obtained using a test (target) nucleic acid 206 of abiological sample from a subject. In typical embodiments, the test(target) nucleic acid 206 is a fragment of the genome of the biologicalsample. In some embodiments, there is a single test (target) nucleicacid 206 (fragment) in a partition. In some embodiments, there are twoor more test nucleic acids 206 (fragments) in a partition eachcorresponding to different portions of the genome of the species of thebiological sample. In some embodiments, there are five or more nucleicacids 206 (fragments) in a partition each corresponding to differentportions of the genome of the species of the biological sample. In someembodiments, there are ten or more nucleic acids 206 in a partition eachcorresponding to different portions of the genome of the species of thebiological sample. In some embodiments, the biological sample is amixture and includes nucleic data representing the genome of two or moreindividuals in a species. In some embodiments, the biological sample isa mixture and includes nucleic data representing the genome of two ormore species. For instance, in some embodiments the biological sample isinfected with a retrovirus. In another example, the biological samplecontains metagenomes because the sample was taken from sand or dirt orsome other location and the goal is to find all the different genomesthat exist in the sample.

The sequencing reads ultimately form the basis of a nucleic acidsequencing dataset 126. Each respective sequencing read 202 in theplurality of sequencing reads comprises a first portion that correspondsto a subset of a test nucleic acid and a second portion that encodesidentification information for the respective sequencing read. Theidentification information is independent of the sequencing data of thetest nucleic acid.

In some embodiments, sequencing read lengths have an N50 (where the sumof the sequence read lengths that are greater than the stated N50 numberis 50% of the sum of all sequencing read lengths). In typicalembodiments, sequencing reads are tens or hundreds of bases in length,which in turn, are aligned to form constructs of at least about 10 kb,at least about 20 kb, or at least about 50 kb. In more preferredaspects, sequencing reads are tens or hundreds of bases in length, whichin turn, are aligned to form constructs having at least about 100 kb, atleast about 150 kb, at least about 200 kb, and in many cases, at leastabout 250 kb, at least about 300 kb, at least about 350 kb, at leastabout 400 kb, and in some cases, at least about 500 kb or more.

In some embodiments, to obtain the plurality of sequencing reads from abiological sample from a subject, a test nucleic acid 206 is fragmentedand these fragments are compartmentalized, or partitioned into discretecompartments or partitions (referred to interchangeably herein aspartitions). In some embodiments, the test nucleic acid is the genome ofa multi-chromosomal organism such as a human. In typical embodiments,multiple sequencing reads are measured from each such compartment orpartition with lengths that are tens or hundreds of bases in length.Sequencing reads from the same compartment or partition that have thesame bar code can be aligned to form sequence constructs that are atleast about 25 kb, at least about 50 kb, 100 kb, at least about 150 kb,at least about 200 kb, and in many cases, at least about 250 kb, atleast about 300 kb, at least about 350 kb, at least about 400 kb, and insome cases, at least about 500 kb or more in length.

Each partition maintains separation of its own contents from thecontents of other partitions. As used herein, the partitions refer tocontainers or vessels that may include a variety of different forms,e.g., wells, tubes, micro or nanowells, through holes, or the like. Inpreferred aspects, however, the partitions are flowable within fluidstreams. In some embodiments, these vessels are comprised of, e.g.,microcapsules or micro-vesicles that have an outer barrier surroundingan inner fluid center or core, or have a porous matrix that is capableof entraining and/or retaining materials within its matrix. In apreferred aspect, however, these partitions comprise droplets of aqueousfluid within a non-aqueous continuous phase, e.g., an oil phase. Avariety of different vessels are described in, for example, U.S. patentapplication Ser. No. 13/966,150, filed Aug. 13, 2013, which is herebyincorporated by reference herein in its entirety. Likewise, emulsionsystems for creating stable droplets in non-aqueous or oil continuousphases are described in detail in, e.g., Published U.S. PatentApplication No. 2010-0105112, which is hereby incorporated by referenceherein in its entirety. In certain embodiments, microfluidic channelnetworks are particularly suited for generating partitions as describedherein. Examples of such microfluidic devices include those described indetail in Provisional U.S. Patent Application No. 61/977,804, filed Apr.4, 2014, as well as PCT/US15/025197, the full disclosures of which areincorporated herein by reference in their entirety for all purposes.Alternative mechanisms may also be employed in the partitioning ofindividual cells, including porous membranes through which aqueousmixtures of cells are extruded into non-aqueous fluids. Such systems aregenerally available from, e.g., NANOMI, Inc.

In the case of droplets in an emulsion, partitioning of the test nucleicacid fragments into discrete partitions may generally be accomplished byflowing an aqueous, sample containing stream, into a junction into whichis also flowing a non-aqueous stream of partitioning fluid, e.g., afluorinated oil, such that aqueous droplets are created within theflowing stream partitioning fluid, where such droplets include thesample materials. As described below, the partitions, e.g., droplets,also typically include co-partitioned barcode oligonucleotides.

The relative amount of sample materials within any particular partitionmay be adjusted by controlling a variety of different parameters of thesystem, including, for example, the concentration of test nucleic acidfragments in the aqueous stream, the flow rate of the aqueous streamand/or the non-aqueous stream, and the like. The partitions describedherein are often characterized by having overall volumes that are lessthan 1000 pL, less than 900 pL, less than 800 pL, less than 700 pL, lessthan 600 pL, less than 500 pL, less than 400 pL, less than 300 pL, lessthan 200 pL, less than 100 pL, less than 50 pL, less than 20 pL, lessthan 10 pL, or even less than 1 pL. Where co-partitioned with beads, itwill be appreciated that the sample fluid volume within the partitionsmay be less than 90% of the above described volumes, less than 80%, lessthan 70%, less than 60%, less than 50%, less than 40%, less than 30%,less than 20%, or even less than 10% the above described volumes. Insome cases, the use of low reaction volume partitions is particularlyadvantageous in performing reactions with very small amounts of startingreagents, e.g., input test nucleic acid fragments. Methods and systemsfor analyzing samples with low input nucleic acids are presented in U.S.Provisional Patent Application No. 62/017,580 Jun. 26, 2014, the fulldisclosure of which is hereby incorporated by reference in its entirety.

Once the test nucleic acid fragments are introduced into theirrespective partitions, the test nucleic acid fragments within partitionsare generally provided with unique identifiers such that, uponcharacterization of those test nucleic acid fragments, they may beattributed as having been derived from their respective partitions. Suchunique identifiers may be previously, subsequently or concurrentlydelivered to the partitions that hold the compartmentalized orpartitioned test nucleic acid fragments, in order to allow for the laterattribution of the characteristics, e.g., nucleic acid sequenceinformation, to the sample nucleic acids included within a particularcompartment, and particularly to relatively long stretches of contiguoussample nucleic acids that may be originally deposited into thepartitions.

Accordingly, the test nucleic acid fragments are typicallyco-partitioned with the unique identifiers (e.g., barcode sequences). Inparticularly preferred aspects, the unique identifiers are provided inthe form of oligonucleotides that comprise nucleic acid barcodesequences that is attached to test nucleic acid fragments in thepartitions. The oligonucleotides are partitioned such that as betweenoligonucleotides in a given partition, the nucleic acid barcodesequences contained therein are the same, but as between differentpartitions, the oligonucleotides can, and preferably have differingbarcode sequences. In some embodiments, only one nucleic acid barcodesequence is associated with a given partition, although in someembodiments, two or more different barcode sequences are present in agiven partition.

The nucleic acid barcode sequences will typically include from 6 toabout 20 or more nucleotides within the sequence of theoligonucleotides. These nucleotides may be completely contiguous, i.e.,in a single stretch of adjacent nucleotides, or they may be separatedinto two or more separate subsequences that are separated by one or morenucleotides. Typically, separated subsequences may typically be fromabout 4 to about 16 nucleotides in length.

The test nucleic acid is typically partitioned such that the nucleicacids are present in the partitions in relatively long fragments orstretches of contiguous nucleic acid molecules. These fragmentstypically represent a number of overlapping fragments of the overalltest nucleic acid to be analyzed, e.g., an entire chromosome, exome, orother large genomic fragment. This test nucleic acid may include wholegenomes, individual chromosomes, exomes, amplicons, or any of a varietyof different nucleic acids of interest. Typically, the fragments of thetest nucleic acid that are partitioned are longer than 1 kb, longer than5 kb, longer than 10 kb, longer than 15 kb, longer than 20 kb, longerthan 30 kb, longer than 40 kb, longer than 50 kb, longer than 60 kb,longer than 70 kb, longer than 80 kb, longer than 90 kb or even longerthan 100 kb.

The test nucleic acid is also typically partitioned at a level whereby agiven partition has a very low probability of including two overlappingfragments of the starting test nucleic acid. This is typicallyaccomplished by providing the test nucleic acid at a low input amountand/or concentration during the partitioning process. As a result, inpreferred cases, a given partition includes a number of long, butnon-overlapping fragments of the starting test nucleic acid. The nucleicacid fragments in the different partitions are then associated withunique identifiers, where for any given partition, nucleic acidscontained therein possess the same unique identifier, but wheredifferent partitions include different unique identifiers. Moreover,because the partitioning step allocates the sample components into verysmall volume partitions or droplets, it will be appreciated that inorder to achieve the desired allocation as set forth above, one need notconduct substantial dilution of the sample, as would be required inhigher volume processes, e.g., in tubes, or wells of a multiwell plate.Further, because the systems described herein employ such high levels ofbarcode diversity, one can allocate diverse barcodes among highernumbers of genomic equivalents, as provided above. In some embodiments,in excess of 10,000, 100,000, 500,000, etc. diverse barcode types areused to achieve genome:(barcode type) ratios that are on the order of1:50 or less, 1:100 or less, 1:1000 or less, or even smaller ratios,while also allowing for loading higher numbers of genomes (e.g., on theorder of greater than 100 genomes per assay, greater than 500 genomesper assay, 1000 genomes per assay, or even more) while still providingfor far improved barcode diversity per genome. Here, each such genome isan example of a test nucleic acid.

Referring to FIG. 2, panels A and B, often the above-describedpartitioning is performed by combining the sample containing the testnucleic acid with a set of oligonucleotide tags (containing thebarcodes) that are releasably-attached to beads 308 prior to thepartitioning step. The oligonucleotides may comprise at least a primerregion 216 and a barcode 214 region. Between oligonucleotides within agiven partition, the barcode region 214 is substantially the samebarcode sequence, but as between different partitions, the barcoderegion in most cases is a different barcode sequence. In someembodiments, the primer region 216 is an N-mer (either a random N-mer oran N-mer designed to target a particular sequence) that is used to primethe nucleic acids within the sample within the partitions. In somecases, where the N-mer is designed to target a particular sequence, theprimer region 216 is designed to target a particular chromosome (e.g.,human chromosome 1, 13, 18, or 21), or region of a chromosome, e.g., anexome or other targeted region. In some cases, the N-mer is designed totarget a particular gene or genetic region, such as a gene or regionassociated with a disease or disorder (e.g., cancer). In some cases, theN-mer is designed to target a particular structural variation. Withinthe partitions, an amplification reaction is conducted using the primersequence 216 (e.g. N-mer) to prime the nucleic acid sample at differentplaces along the length of the nucleic acid. As a result of theamplification, each partition contains amplified products of the nucleicacid 202 that are attached to an identical or near-identical barcode,and that represent overlapping, smaller fragments of the nucleic acidsin each partition. The barcode 214 therefore serves as a marker thatsignifies that a set of nucleic acids originated from the samepartition, and thus potentially also originated from the same strand oftest nucleic acid. Following amplification, the nucleic acids arepooled, sequenced, and aligned using a sequencing algorithm. Becauseshorter sequence reads may, by virtue of their associated barcodesequences, be aligned and attributed to a single, long fragment of thetest nucleic acid, all of the identified variants on that sequence canbe attributed to a single originating fragment and single originatingchromosome of the test nucleic acid. Further, by aligning multipleco-located variants across multiple long fragments, one can furthercharacterize that chromosomal contribution. Accordingly, conclusionsregarding the phasing of particular genetic variants may then be drawn.Such information may be useful for identifying haplotypes, which aregenerally a specified set of genetic variants that reside on the samenucleic acid strand or on different nucleic acid strands. Moreover,additionally or alternatively, structural variants are identified.

In some embodiments, the co-partitioned oligonucleotides also comprisefunctional sequences in addition to the barcode region 214 and theprimer region 216 region of the nucleic acids within the sample withinthe partitions. See, for example, the disclosure on co-partitioning ofoligonucleotides and associated barcodes and other functional sequences,along with sample materials as described in, for example, U.S. PatentApplication Nos. 61/940,318, filed Feb. 7, 2014, 61/991,018, Filed May9, 2014, and U.S. patent application Ser. No. 14/316,383, (AttorneyDocket No. 43487-708.201) filed on Jun. 26, 2014, as well as U.S. patentapplication Ser. No. 14/175,935, filed Feb. 7, 2014, the fulldisclosures of which is hereby incorporated by reference in theirentireties.

In one exemplary process, beads are provided, where each such beadincludes large numbers of the above described oligonucleotidesreleasably attached to the beads. In such embodiments, all of theoligonucleotides attached to a particular bead include the same nucleicacid barcode sequence, but a large number of diverse barcode sequencesare represented across the population of beads used. Typically, thepopulation of beads provides a diverse barcode sequence library thatincludes at least 1000 different barcode sequences, at least 10,000different barcode sequences, at least 100,000 different barcodesequences, or in some cases, at least 1,000,000 different barcodesequences. Additionally, each bead typically is provided with largenumbers of oligonucleotide molecules attached. In particular, the numberof molecules of oligonucleotides including the barcode sequence on anindividual bead may be at least about 10,000 oligonucleotides, at least100,000 oligonucleotide molecules, at least 1,000,000 oligonucleotidemolecules, at least 100,000,000 oligonucleotide molecules, and in somecases at least 1 billion oligonucleotide molecules.

In some embodiments, the oligonucleotides are releasable from the beadsupon the application of a particular stimulus to the beads. In somecases, the stimulus may be a photo-stimulus, e.g., through cleavage of aphoto-labile linkage that may release the oligonucleotides. In somecases, a thermal stimulus may be used, where elevation of thetemperature of the beads environment may result in cleavage of a linkageor other release of the oligonucleotides form the beads. In some cases,a chemical stimulus may be used that cleaves a linkage of theoligonucleotides to the beads, or otherwise may result in release of theoligonucleotides from the beads.

In accordance with the methods and systems described herein, the beadsincluding the attached oligonucleotides may be co-partitioned with theindividual samples, such that a single bead and a single sample arecontained within an individual partition. In some cases, where singlebead partitions are desired, it may be desirable to control the relativeflow rates of the fluids such that, on average, the partitions containless than one bead per partition, in order to ensure that thosepartitions that are occupied, are primarily singly occupied. Likewise,one may wish to control the flow rate to provide that a higherpercentage of partitions are occupied, e.g., allowing for only a smallpercentage of unoccupied partitions. In preferred aspects, the flows andchannel architectures are controlled as to ensure a desired number ofsingly occupied partitions, less than a certain level of unoccupiedpartitions and less than a certain level of multiply occupiedpartitions.

FIG. 3 of U.S. Patent Application No. 62/072,214, filed Oct. 29, 2014,entitled “Analysis of Nucleic Acid Sequences,” which is herebyincorporated by reference and the portions of the specification thereindescribing FIG. 3 provide a detailed example of one method for barcodingand subsequently sequencing a test nucleic acid (referred to in thereference as a “sample nucleic acid”) in accordance with one embodimentof the present disclosure. As noted above, while single bead occupancymay be the most desired state, it will be appreciated that multiplyoccupied partitions, or unoccupied partitions may often be present. FIG.4 of U.S. Patent Application No. 62/072,214, filed Oct. 29, 2014,entitled “Analysis of Nucleic Acid Sequences,” which is herebyincorporated by reference and the portions of the specificationdescribing FIG. 4 therein provide a detailed example of a microfluidicchannel structure for co-partitioning samples and beads comprisingbarcode oligonucleotides in accordance with one embodiment of thepresent disclosure.

Once co-partitioned, the oligonucleotides disposed upon the beads may beused to barcode and amplify the partitioned samples. One process for useof these barcode oligonucleotides in amplifying and barcoding samples isdescribed in detail in U.S. Patent Application Nos. 61/940,318, filedFeb. 7, 2014, 61/991,018, Filed May 9, 2014, and U.S. patent applicationSer. No. 14/316,383, (Attorney Docket No. 43487-708.201) filed on Jun.26, 2014, the full disclosures of which are hereby incorporated byreference in their entireties. Briefly, in one aspect, theoligonucleotides present on the beads that are co-partitioned with thesamples are released from their beads into the partition with thesamples. The oligonucleotides typically include, along with the barcodesequence, a primer sequence at its 5′ end. This primer sequence may be arandom oligonucleotide sequence intended to randomly prime numerousdifferent regions of the samples, or it may be a specific primersequence targeted to prime upstream of a specific targeted region of thesample.

Once released, the primer portion of the oligonucleotide can anneal to acomplementary region of the sample. Extension reaction reagents, e.g.,DNA polymerase, nucleoside triphosphates, co-factors (e.g., Mg²⁺ or Mn²⁺etc.), that are also co-partitioned with the samples and beads, thenextend the primer sequence using the sample as a template, to produce acomplementary fragment to the strand of the template to which the primerannealed, with complementary fragment that includes the oligonucleotideand its associated barcode sequence. Annealing and extension of multipleprimers to different portions of the sample may result in a large poolof overlapping complementary fragments of the sample, each possessingits own barcode sequence indicative of the partition in which it wascreated. In some cases, these complementary fragments may themselves beused as a template primed by the oligonucleotides present in thepartition to produce a complement of the complement that again, includesthe barcode sequence. In some cases, this replication process isconfigured such that when the first complement is duplicated, itproduces two complementary sequences at or near its termini, to allowthe formation of a hairpin structure or partial hairpin structure thatreduces the ability of the molecule to be the basis for producingfurther iterative copies. A schematic illustration of one example ofthis is shown in FIG. 2.

As FIG. 2 shows, oligonucleotides 202 that include a barcode sequence214 are co-partitioned in, e.g., a droplet 204 in an emulsion, alongwith a sample test nucleic acid fragment 206. In some embodiments, theoligonucleotides 202 are provided on a bead 208 that is co-partitionedwith the test nucleic acid fragment 206, which oligonucleotides arepreferably releasable from the bead 208, as shown in FIG. 2, panel (A).As shown in FIG. 2 panel (B), the oligonucleotides 202 includes abarcode sequence 214, in addition to one or more functional sequences,e.g., sequences 212, 214 and 216. For example, oligonucleotide 202 isshown as further comprising sequence 212 that may function as anattachment or immobilization sequence for a given sequencing system,e.g., a P5 sequence used for attachment in flow cells of an ILLUMINA,HISEQ or MISEQ system. In other words, attachment sequence 212 is usedto reversibly attach oligonucleotide 202 to a bead 208 in someembodiments. As shown in FIG. 2, panel B, the oligonucleotide 202 alsoincludes a primer sequence 216, which may include a random or targetedN-mer (discussed above) for priming replication of portions of thesample test nucleic acid fragment 206. Also included within exemplaryoligonucleotide 202 of FIG. 2, panel B, is a sequence 210 which mayprovide a sequencing priming region, such as a “read1” or R1 primingregion, that is used to prime polymerase mediated, template directedsequencing by synthesis reactions in sequencing systems. In many cases,the barcode sequence 214, immobilization (attachment) sequence 212 andexemplary R1 sequence 214 may be common to all of the oligonucleotides202 attached to a given bead. The primer sequence 216 may vary forrandom N-mer primers, or may be common to the oligonucleotides on agiven bead for certain targeted applications. FIGS. 3B through 3E andthe specification describing these Figures in U.S. Prov. Application No.62/113,693, entitled “Systems and Methods for Determining StructuralVariation,” filed Feb. 9, 2014 detail how oligonucleotides 202 formsequencing reads of the sample test nucleic acid, where each suchsequencing read includes a first portion that is a sequencing read ofthe sample test nucleic acid and a second portion that is theoligonucleotide 202. Such sequencing reads, and analysis of suchsequencing reads, form the basis of the disclosed nucleic acidsequencing dataset 126.

In some embodiments, the sequencing reads in a nucleic acid sequencingdataset 126 are processed in order to sequence the at least one targetnucleic acid. In some embodiments conventional methods are used toprocess the nucleic acid sequence reads in order to establish a sequencefor the at least one target nucleic acid. In some embodiments the novelmethods disclosed in PCT application PCT/US2015/038175, entitled“Processes and Systems for Nucleic Acid Sequence Assembly,” filed Jun.26, 2015, which is hereby incorporated by reference, are used to processthe nucleic acid sequence reads in order to establish a sequence for theat least one target nucleic acid. In some embodiments, such sequencinginvolves mapping the sequencing reads to a reference genome, such as thegenome of the species from which the sample is taken. In someembodiments, the sample is expected, or suspected, of containingmultiple genomes (e.g., the case in which a sample, such as a humansample, infected with a retrovirus). In such cases, multiple referencegenomes, from different species may be concurrently used.

In some embodiments, the sequencing reads are processed by phasing themand by looking for structural variations. In some embodiments,conventional phasing methods and structural variation methods are used.In some embodiments, novel phasing methods and structural variationmethods, such as those disclosed in United States ProvisionalApplication No. 62,238,077, entitled “Systems and Method for DeterminingStructural Variation Using Probabilistic Models,” filed Oct. 6, 2015,which is hereby incorporated by reference, are used. Although notdisclosed in this reference, in some embodiments the teachings of thereference are extended to incorporate multiple reference genomes ininstances where the sample potential contains nucleic acid from multiplereference genomes. For instance, in the case where the sample is humanbut it is possible that the sample is infected with a retrovirus, thegenome of the retrovirus is treated as an additional chromosome. In thisway, it is possible to extend the visualization methods disclosed in thepresent disclosure to identify insertion of nucleic acid constructs,such as retroviruses, into the genome of the sample under study.

So, for example, the disclosed techniques can use the bar codes todistinguish the following two scenarios. One is a human sample with HPVvirus free floating in the sample but the virus hasn't been insertedinto the human DNA. They are a free floating molecule—separatemolecules, separate virus, separate human DNA. In that case, themeasured sequence reads are going to include reads that map to HPV aswell as the human genome but there will not be bar codes in common withthe HPV and the human genome meaning that the human genome and the HPVare distinct. On the other hand, if the HPV molecule has been insertedinto a human chromosome or two, what will be measured is are sequencereads that map to both a human chromosome and the HPV at the same timeand share the same bar codes meaning that they exist in the samemolecule as opposed to separate molecules (e.g., the HPV has beenincorporated into a human chromosome). Moreover, the bar codes can beused to localize the precise location(s) of the HPV insertion into thehuman chromosome.

FIG. 3 illustrates the data that is obtained from a biological sample ofa subject (e.g., a particular human). This data is summarized in theform of a nucleic acid sequence dataset 126. In some instances, afull-genome run of the type described above produces 30-40 gigabytesworth of data. In accordance with some aspects of the presentdisclosure, such raw data is condensed into a nucleic acid sequencedataset 126 that is a fraction of the size of the raw data. In someembodiments, although the raw data is condensed to form the nucleic acidsequence dataset 126, the dataset 126 is still too large to load intothe RAM of typical computers. For instance, in some embodiments, nucleicacid sequence dataset 126 is five gigabytes or larger, ten gigabytes orlarger, or fifteen gigabytes or larger.

As illustrated in FIG. 3, the exemplary nucleic acid sequencing dataset126 is organized into three parts, a header 302, a synopsis 308, and adata section 340. The purpose of the header 302 is to delineate thecomponents 304 of the dataset 126 as well as, optionally, provide theversion 306 of the dataset 126 structure, e.g., version 1.7. In someembodiments, the header 302 is formatted as a JSON structure tofacilitate loading using web based applications such as a web browser.See the URL json.org, which is hereby incorporated by reference. Forinstance, in some embodiments, the header is formatted as a JSON object:beginning with (left brace) and ending with (right brace), with eachname is followed by: (colon) and the name/value pairs are separated by,(comma). In one exemplary embodiment, the header 302 that specifies thatthe sequencing dataset has 126 has the components: fragment tracks(e.g., the length, position, barcode, and phase of all the fragments inthe dataset), targets track (the regions of the genome selected by thecapture protocol used during processing), structural variation track(lists of all the structural variants called in the sample), an index toa target dataset, vcf index (an index that relates ranges of the genometo a position in the dataset 126 file), marker, phase block summary (adescription of the various phase blocks in the test nucleic acid 206),genetrack (a description of all human genes, tagged with the number ofSNPs in each gene), BAM data (associates ranges of the genome to theposition in the file containing read information for that range),summary (high level metrics extracted from the sequencing data), andrefseq index (an index that contains a list of dbSNP identifiers (RSIDs)of SNPs that are called in the sample, thereby associating the RSID withits position in the genome).

The synopsis section 308 contains data that is read by haplotypevisualization tool 148 into volatile (e.g., random access) memory,typically in its entirety, when the dataset 126 is accessed. This dataconsists of indexes into the data section 340 as well as other data thatis referenced frequently by visualization tool 148. As illustrated inFIG. 3, the synopsis section 308 is split up into several componentswhich correspond to the “index” array (e.g., component list 302) in theheader section 302.

Summary 310 provides high level metrics extracted from the data. In someembodiments, summary 310 is used by summarization module 150 to providesummary data such as that illustrated in FIGS. 12 and 13. This includesthe percentage of known SNPs (e.g., human SNPs) phased 1202, the longestphase block 1204, the effective barcode count 1206 (e.g., the number ofunique barcodes used in the dataset 126), average fragment length 1208,mean of average fragment length 1210, percentage of fragments greaterthan a lower threshold (e.g., 20 kb) 1212, fragment length histogram orother form of fragment length metric 1214, N50 phase block size 1216,phase block length histogram or other form of phase block length metric1218, number of sequence reads represented by the dataset 1220, medianinsert size 1222, median depth 1224, percent of the target genome withzero coverage 1226, mapped reads percentage 1228, PCR duplicationpercentage 1230, on target bases (percent) 1232, coverage histogram orother form of coverage metric 1234, source of dataset in memory 112(1234), identity of test nucleic acid (1236), genome source (1238), sexof donating organism (1240), dataset file format version 1242, andpointer to structural variant calls 1244 made for dataset 126 (1244).

Index to variant call data 312 is an example of an index found in thesummary and it relates respective ranges 214 of the genome of the targetnucleic acid to offsets 316 in the corresponding data section 340 wherevariant call data for the respective ranges is found.

In some embodiments, the phase block track 318 is stored in the synopsissection 308 of the nucleic acid sequencing dataset 126. More details ofthe architecture of an exemplary phase block track 318 are found in FIG.4. Referring to FIG. 4, in some embodiments, the phase block track 318includes a dictionary section 402 and a track data section 408. thetrack data section comprises a plurality of records 410. In someembodiments, each record in the plurality of records comprises phaseinformation for a corresponding chromosome. In some embodiments, each ofthe one or more data sections stores phase information for one or morecorresponding chromosomes. In some embodiments, each of the one or moredata sections stores phase information in an interval tree 422 formatfor a corresponding chromosome.

The dictionary 402 of the phase block track 318 comprises a plurality ofnames 404, and for each name 404, an offset 406 into the track data 408where records for the corresponding name 404 are found. In someembodiments, the dictionary 402 for the phase block track 318 contains asingle name, e.g., “phase_data”.

In some embodiments, the track data 408 is in JSON format. In someembodiments, each record 410 represents a phase block in the targetnucleic acid. As such, in some embodiments, each record 410 specifies achromosome number 412 that the phase block is on as well as the positionwhere the phase block starts 414 on the chromosome 412 and a positionwhere the phase block ends 416 on the chromosome 412. Moreover, there isa unique name 418 for each record and phasing information 420 about thephase block. In some embodiments, the purpose for the information 420 isto provide details of phasing information of the phase block. In someembodiments, a phase block includes information about two haplotypescorresponding to the two parents (e.g., respectively denoted haplotype“A” and haplotype “B”). Accordingly, in some embodiments, the phaseinformation comprises PhaseASNP 422 (the number of counted singlenucleotide polymorphisms on haplotype “A” in the phase block), UnphasedSNP 424 (the number of counted single nucleotide polymorphisms ofunknown haplotype in the phase block) and PhaseBSNP (the number ofcounted single nucleotide polymorphisms on haplotype “B” in the phaseblock). As such, the track data 408 holds certain phase block data(e.g., SNP counts) for the nucleic acid sequencing dataset 126.Techniques for phasing genomic data and phase blocks are described inBrowning and Browning, “Haplotype phasing: Existing methods and newdevelopments,” Nat Rev Genet.; 12(10): 703-714. doi:10.1038/nrg3054,which is hereby incorporated by reference in its entirety.

In some embodiments, the track data 408 is put into context bycorresponding interval trees 422. As such, each record 410 isrepresented by a node 424 in an interval tree 422. Each such intervaltree 422 is a ternary tree with each node 424 of the tree storing amidpoint of the node x_(med) 432. This midpoint 432 is the position ofthe midpoint, on the corresponding chromosome, of the phase blockcorresponding to the node. Each respective node 424 has a link to a leftchild node 428, which corresponds to the phase block immediately to theleft of the phase block represented by the respective node 424 in thegenome of the species of the target (genetic source) organism. Eachrespective node 424 has a link to a right child node 430, whichcorresponds to the phase block immediately to the right of the phaseblock represented by the respective node 424. Each respective node 424has a sorted set of nodes 425 that represent phase blocks that overlapthe x_(med) 432 of the respective node 424 sorted by left hand positionof such phase block. Each respective node 424 has a sorted set of nodes436 that represent phase blocks that overlap the x_(med) 432 of therespective node 424 sorted by right hand position of such phase blocks.In some embodiments, sorted sets 425 and 436 are represented in a node424 by arrays or linked lists. Each respective node 424 further includesa name 426, which is an offset in track data 410 to the record 410 thatcontains phase information 420 for the phase block corresponding to therespective node 424.

As illustrated in FIG. 4, in some embodiments, there is a separateinterval tree 422 for each chromosome in the phase block track. Suchinterval trees advantageously provide a quick way of identifying allrecords 410 pertaining to a user specified region of the of the targetgenome. An example of a phase block track 318 is found in FIG. 5. InFIG. 5, exemplary elements that correspond to the data structure of FIG.4 are illustrated.

Referring to FIG. 3, in some embodiments, the synopsis 308 furthercomprises a refseq index 319, which is an index that contains themolecular variation (e.g., SNP) identifiers that are called in thesample corresponding to the nucleic acid sequencing dataset. The refseqindex 319 associates each such identifier with its position in thegenome of the target organism. In some embodiments, the refseq index 319is stored as a JSON data structure. In some embodiments, eachpolymorphism identifier in the refseq index 319 is a dbSNP identifierfound in the National Center for Biotechnology Information (NCBI)database. See Wheeler et al., 2007, “Database resources of the NationalCenter for Biotechnology Information,” Nucleic Acids Res. 35 (Databaseissue): D5-12, which is hereby incorporated by reference. Such dbSNPidentifiers are termed reference SNP cluster IDs (RSIDs).

In some embodiments, the synopsis 308 further comprises a gene track320, which provides a reference of human genes tagged with the number ofSNPs found in each gene. More details of the architecture of anexemplary gene track 320 are found in FIG. 6. Referring to FIG. 6, insome embodiments, the gene track 320 includes a dictionary section 602,a track data section 608, and one or more data sections 628. In someembodiments, each of the one or more data sections stores geneinformation for a corresponding chromosome. In some embodiments, each ofthe one or more data sections stores gene information for one or morecorresponding chromosomes. In some embodiments, each of the one or moredata sections stores gene information in an interval tree 628 format fora corresponding chromosome.

The dictionary 602 of the gene track 320 comprises a plurality of names604, and for each name 604, an offset 606 into the track data 608 whererecords for the corresponding name 604 are found. In some embodiments,each name 604 in dictionary 602 is the name of a chromosome in thetarget genome.

In some embodiments, the track data 608 for gene track 320 comprises aplurality of gene records 610. In some embodiments, the track data 608is in JSON format. In some embodiments, each gene record 610 representsa gene in the species of the target nucleic acid. As such, in someembodiments, each gene record 610 specifies a chromosome number 612 thecorresponding gene is on, the position where the gene starts 614 on thechromosome 612 and a position where the gene ends 616 on the chromosome612. Moreover, there is a unique name 618 for each gene record and geneinformation 620 about the gene. In some embodiments, the purpose for theinformation 620 is to provide genetic information about the gene, suchas, for example, an alternative name 622 for the gene, a count of singlenucleotide polymorphisms 624 on the gene, and a direction (e.g., plus orminus) 626 of the gene.

In some embodiments, the track data 608 is put into context by thecorresponding interval trees 628. Each gene record 610 forms a node 630in an interval tree 628. Each interval tree 628 is a ternary tree witheach node 630 storing a midpoint of the node x_(med) 642. This midpoint642 is the position of the midpoint, on the corresponding chromosome, ofthe gene corresponding to the node. Each respective node 630 has a linkto a left child node 638, which corresponds to the gene immediately tothe left (lesser position on the chromosome) of the gene represented bythe respective node 630 in the species of the target organism. Eachrespective node 630 has a link to a right child node 640, whichcorresponds to the gene immediately to the right of the gene (greaterposition on the chromosome) represented by the respective node 630 inthe species of the target organism. Each respective node 620 has asorted set of nodes 632 that respectively represent genes that overlapx_(med). 632 of the respective node 620 sorted by left hand position.Each respective node 630 has a sorted set of nodes 630 that respectivelyrepresent genes that overlap the x_(med) 642 of the respective node 630sorted by right hand position. In some embodiments, sorted sets 632 and644 are represented in a node 630 by arrays or linked lists. Eachrespective node 630 further includes a name 636, which is an offset intrack data 608 to the gene record 610 that contains genetic information620 for the gene corresponding to the respective node 630.

As illustrated in FIG. 6, in some embodiments, there is a separateinterval tree 628 for each chromosome in the gene track 320. Suchinterval trees advantageously provide a quick way of identifying allrecords 610 pertaining to a user specified region of the of the targetgenome. An example of a gene track 320 is found in FIG. 7. In FIG. 7,exemplary elements that correspond to the data structure of FIG. 6 areillustrated.

In some embodiments, the synopsis 308 further comprises an exon track322. In some embodiments, the exon track 322 has the same architectureas the gene track 320, the exception being that whereas the gene track320 represents genetic information for genes in the species of thetarget organism, the exon track 320 provides genetic information forexons in the species of the target organism.

In some embodiments, the synopsis 308 further comprises an index to readdata 324. This index 324 provides an index into sequence/read data 1048in the data section 340 of the nucleic acid sequencing set, which isdescribed in more detail below with reference to FIG. 10. Referring toFIG. 3, the index 324 comprises a database which associates identifiersto the barcodes used in the dataset (not shown). The database (lookuptable) which associates identifiers to the barcodes used in the datasetis a useful way to compress the size of read data 1048, becauseidentifiers can be used instead of the longer actual barcodes. This isbecause not all theoretically possible bar codes, for a given degree ofinformation content, are used in a given dataset 126.

The index 324 further comprises a per chromosome array ofchromosome-offset-->file-offset associations 328 into read data 1048 aswell as a length of each such data element which allow lookup of thecorresponding data for a specific genomic range. In some embodiments theread data is stored as a blocked index, and each record 328 is a fixedbit record for each entry in a BAM file that was incorporated into thedataset 126. Each such entry in the BAM file is organized into chunkswithin the data section 340 of the file. The index 324 in the synopsis308 helps to find the correct chunk within the data section 340 to read.Referring to FIG. 10, the corresponding architecture of thesequence/read data 1048 indexed by index 324 is disclosed. For eachchromosome, read data 1048 is stored in chunks 1050. In someembodiments, each data chunk 1050 is an array of 64-bit structures 1052in the following format:

where O is always O, X indicates the read quality is below a thresholdvalue (e.g., below 60), L indicates the read is from parental haplotypeA, R indicates the read is from parental haplotype B, I is a numericalidentifier corresponding to the barcode in the read, E is the ‘end’length of the read, and S is the ‘start’ position of this read, relativeto the start of the chunk 1050. More generally, referring to FIG. 10,each structure 1052 corresponds to a single read from the target nucleicacid for the single organism of a species and comprises a start(offset), a length, an indicator to a bar code and some flags. In someembodiments the start within structure 1052 is the real position on thechromosome minus the start value stored for the chunk 1050 in thechromosome offset field of record 328 of index 324. Advantageously, thisallows for avoidance of larger repetition of genomic coordinates in thestructures 1052. Such coordinates can be in the billions and thus wouldrequired 30 bits to store. Advantageously, by chunking, as disclosed insequence/read data 1048, each chunk covers up to about one million basepairs and thus each start (offset) in each structure 1052 in a chunkonly needs 20 bits, since the range for any given chunk is specified bythe chromosome offset/length portions of the corresponding record 328 inthe index 324 stored in the synopsis 308. Similarly, as outlined above,in preferred embodiments, the barcode field in structure 1052 doesn'tstore the actual barcode. In some embodiments, the barcode indicator instructure 1052 is a 24-bit index into a barcode table that is stored inthe index 324. So, when the actual barcode associated with a particularread is needed, the structure 1052 corresponding to the read isaccessed, and the 24-bit bar code indicator in the structure 1052 isqueried against the barcode table in the index 324 to obtain the barcode. In this way, 30 bit bar codes in the structures 1052 are avoided.In some embodiments, the bar code is greater than 30 bits (e.g., 32bits, 34 bits, 36 bits or larger) and the indicator to the bar code instructure 1052 is greater than 20 bits (e.g., 22 bits, 24 bits, 26 bitsor larger). In some embodiments, the bar code is less than 30 bits(e.g., 28 bits, 26 bits, 24 bits or smaller) and the indicator to thebar code in structure 1052 is less than 20 bits (e.g., 18 bits, 16 bit,14 bits or smaller). In some embodiments, each data chunk 1050 is anarray of structures 1052 having the same predetermined size (e.g., 128bits, 64 bits, 32 bits, or some other fixed bit size).

In some embodiments, the synopsis 308 further comprises a structuralvariant dataset track 330. In some embodiments, the structural variantsdataset track 330 comprises a listing of the called structural variantsin the sample represented by the dataset 126. More details of thearchitecture of an exemplary structural variant dataset track 330 arefound in FIG. 8. Referring to FIG. 8, in some embodiments, thestructural variant dataset 330 includes a dictionary section 802, atrack data section 808, and one or more data sections 840. In someembodiments, each of the one or more data sections 840 stores structuralvariant call information for a corresponding chromosome. In someembodiments, each of the one or more data sections 840 stores structuralvariant call information for one or more corresponding chromosomes. Insome embodiments, each of the one or more data sections 840 stores geneinformation in an interval tree format for a corresponding chromosome.

The dictionary 802 of the structural variant dataset track 330 comprisesa plurality of names 804, and for each name 804, an offset 606 into thetrack data 808 where records for the corresponding name 804 are found.In some embodiments, each name 804 in dictionary 802 is the name of achromosome in the target genome.

In some embodiments, the track data 808 for structural variant datasettrack 330 comprises a plurality of structural variant records 810. Insome embodiments, the track data 808 is in JSON format. In someembodiments, each structural variant record 810 represents a structuralvariant call made for the target nucleic acid of the single organismrepresented by the dataset 126. As such, in some embodiments, eachstructural variant record 810 specifies a chromosome number 812, a startposition 814 represented by the structural variation, a stop position816 represented by the structural variation on the chromosome 812, aunique name 818 for the structural variation, and information 820 aboutthe structural variation. In some embodiments, the structural variantdataset track 330 includes information analogous, corresponding to, orin a BEDPE format to advantageously concisely describe disjoint genomefeatures, such as structural variations or paired-end sequencealignments. See the URLbedtools.readthedocs.org/en/latest/content/general-usage.html, which ishereby incorporated herein by reference. Accordingly, in someembodiments, the information section 820 in each structural variantrecord 810 includes a chromosome 1 name 822, which is the name of thechromosome on which the first end of the feature exists. In someembodiments chromosome 1 name 822 is in string format, for example,“chr1”, “III”, “myChrom”, or “contig1112.23.”

In some embodiments, the information section 820 in each record 810further comprises a start 1 position 830, which is a zero-based startingposition of the first end of the feature on chromosome 1 name 822.

In some embodiments, the information section 820 in each record 810further comprises stop 1 (end 1) position 826, which is the one-basedending position of the first end of the feature (e.g., structuralvariation) represented by record 810 on chromosome 1 name 822.

In some embodiments, the information section 820 in each record 810further comprises chromosome 2 name 836, which is the name of thechromosome on which the second end of the feature represented by record810 exists. In some embodiments chromosome 2 name 836 is in stringformat, for example, “chr1”, “III”, “myChrom”, or “contig1112.23.”

In some embodiments, the information section 820 in each record 810further comprises a start 2 position 828, which is the zero-basedstarting position of the second end of the feature represented by record810 on chromosome 2 name 836.

In some embodiments, the information section 820 in each record 810further comprises a stop 2 (end 2) position 824, which is the one-basedending position of the second end of the feature (e.g., structuralvariation) represented by record 810 on chromosome 2 name 836.

In some embodiments, the information section 820 in each record 810further comprises a name of the structural variant field 834, which isthe name of the feature (e.g., structural variation) represented byrecord 810. In some embodiments, the name of the structural variant 834is in string format, for example, “LINE”, “Exon3”,“HWIEAS_0001:3:1:0:266#0/1”, or “my_Feature”.

In some embodiments, the information section 820 in each record 810further comprises a quality (score) field 832, which is any metric thescores the quality of the feature (e.g., structural variation)represented by record 810. In some embodiments, quality 832 is in stringformat thereby permitting the expression of quality of the feature inany scientific metric, e.g., p-values, mean enrichment values, etc.

In some embodiments, the information section 820 in each record 810further comprises further information 838 on the feature represented bythe record 81, such as edit distance for each end of an alignment, or“deletion”, “inversion”, etc.).

Continuing to refer to FIG. 8, in some embodiments, the track data 808is put into context by the corresponding interval trees 840. Each record810 forms a node 842 in an interval tree 840. Each interval tree 840 isa ternary tree with each node 842 storing a midpoint of the node x_(med)852. This midpoint 852 is the position of the midpoint, on thecorresponding chromosome, of the feature (e.g., structural variant)corresponding to the node and represented by the corresponding record810. Each respective node 842 has a link to a left child node 848, whichcorresponds to the feature (e.g., structural variant) immediately to theleft (lesser position on the chromosome) of the feature represented bythe respective node 842 in the dataset 126. Each respective node 842 hasa link to a right child node 850, which corresponds to the feature(e.g., structural variant) immediately to the right (greater position onthe chromosome) of the feature represented by the respective node 842 inthe dataset 126. Each respective node 842 has a sorted set of nodes 854that respectively represent features (e.g., structural variant) thatoverlap x_(med) 852 of the respective node 842 sorted by left handposition. Each respective node 842 has a sorted set of nodes 844 thatrespectively represent features that overlap the x_(med) 852 of therespective node 842 sorted by right hand position. In some embodiments,sorted sets 844 and 854 are represented in a node 840 by arrays orlinked lists. Each respective node 840 further includes a name 846,which is an offset in track data 808 to the record 810 that containsinformation 820 for the feature (e.g., structural variation)corresponding to the respective node 840.

As illustrated in FIG. 8, in some embodiments, there is a separateinterval tree 840 for each chromosome in the structural variant datasettrack 330. Such interval trees advantageously provide a quick way ofidentifying all records 810 pertaining to a user specified region of theof the target genome. An example of a portion of a structural variantdataset track 330 is found in FIG. 9. In FIG. 9, exemplary elements thatcorrespond to the data structure of FIG. 8 are illustrated.

Referring to FIG. 3, in some embodiments, the synopsis 308 furthercomprises an index 332 to the target dataset 342. The target dataset 342comprises the regions of the at least one target nucleic acid in thesample that were selected for sequencing in the nucleic acid sequencingdataset. In some embodiments index 332 and target dataset 342 are storedin a blocked JSON index. The blocked JSON index includes a single JSONobject in the synopsis section (the index 332) and multiple JSON objectsin the data section (the target dataset 342). The index 332 is used tocalculate which data components must be read to fulfill a particularquery. In some embodiments, the index 332 is split up by chromosome. Foreach chromosome, the index 332 stores an array (record) 334 associatingranges on that chromosome with the offset at which specific data forthat range may be found in the target dataset. In some embodiments, thetarget dataset 342 contains many independent arrays. Each array containsall of the ranges (and associated data) for one contiguous range of thegenome. Each array in the target dataset 342 corresponds to a singlearray (entry) 334 in the index 332. In some embodiments, each such arrayin the target dataset is sized to contain about 1,000 entries. Becauseit is possible for a specific range to overlap multiple “chunks”, thesame data may be written into multiple consecutive arrays. Referring toFIG. 3, in some embodiments, the synopsis 308 further comprises an index336 to the fragment dataset 344. The fragment dataset 344 comprises thelength, position, barcode, and phase of all the fragments in the nucleicacid sequencing dataset. A fragment is the nucleic acid from a singlepartition, as described above. In some embodiments index 336 andfragment dataset 344 are stored in a blocked JSON index. The blockedJSON index includes a single JSON object in the synopsis section (theindex 336) and multiple JSON objects in the data section (the fragmentdataset 344). The index 336 is used to calculate which data componentsmust be read to fulfill a particular query. In some embodiments, theindex 336 of is split up by chromosome. For each chromosome, the index336 stores an array 338 associating ranges on that chromosome with theoffset at which specific data for that range may be found in thefragment dataset 344. An example of a data chunk in the fragment dataset344 is:

{ “Chromosome” : “chr1”, “Name” : “19002” , “Info” : { “h0” :“0.100000017888” , “h1” : “0.899999982112”, “hmix” : “0.0\n”,“phsae_set” : “107163622”, “ps_start” : “7163622”, “be” :“CGTICCGTGGTATA-1”, “ps_end” : “7276533” “Stop” : 7235518, “Start” :7213929 }

Thus, as the above provides, the disclosed nucleic acid sequencingdatasets 126 of the present disclosure provide a streamlined file formatthat combines several forms of data that is conventionally found inseparate files along with data that is of only secondary value.Advantageously, the nucleic acid sequencing dataset 126 file format isself-contained and has all the data required to support the features ofhaplotype visualization tool 148.

FIGS. 12-30 illustrate an embodiment of the haplotype visualization tool148 that reads nucleic acid sequencing datasets 126. In someembodiments, the haplotype visualization tool 148 is a variant orientedand haplotype aware genome browser. To produce such views, the haplotypevisualization tool 148 overlays data from several sources as tracks intoa single unified nucleic acid sequencing dataset 126 for display thatcan be scrolled and zoomed. In some embodiments, the tracks that arestored includes phased variant calls, phase blocks, genes, exons,structural variant breakpoints and read count (coverage) as tracks. Onesuch embodiment for how such information is stored is disclosed in FIG.3 and described above. Advantageously the disparate information in thenucleic acid sequencing set can be displayed in a single display. Thehaplotype visualization tool 148 is distinguished from other genomebrowsers by its ability to show phasing information. Referring to FIGS.12 and 13, from the summarization module displayed in FIGS. 12 and 13, auser can advantageously use the search prompt 1250 to select regions ofthe nucleic acid sequencing dataset for further analysis. In someembodiments, through search prompt 1250, the haplotype visualizationtool 148 supports a broad range of valid search syntaxes such aschr1:1000000 (select the first million nucleotides of chromosome 1),chr1:1000000-2000000 (select the second million nucleotides ofchromosome 1), BRCA1, BRCA2 (select BRCA1 and BRCA2), andchr1:1000000-2000000, chr2:5000000-6000000 (select the second millionnucleotides of chromosome 1 and the fifth million nucleotides ofchromosome 2). In some embodiments, the user provides a symbolic name ofa gene and the haplotype visualization tool 148 converts this symbolicname to the appropriate genomic coordinates by using one or more lookuptables that convert symbolic names to genomic coordinates.Advantageously, a user can provide in a single search a mix of absolutecoordinate ranges and gene names. In some embodiments, a user provides asingle search query that includes multiple loci. Responsive to such aquery, the haplotype visualization tool 148 parses the multiple loci andprovides results for each such query. In some embodiments, the userprovides a search query of syntax is X₁:N₁-N₂, where X₁ is an identityof a selected first chromosome or a selected first contig sequence, N₁is a selected start position within the first chromosome or the selectedfirst contig sequence, and N₂ is a selected end position within thefirst chromosome or the selected first contig sequence. As used in thiscontext, the term “contig” means any “contig” from a reference genomewhich could correspond to an isolated molecule of interest that isn't achromosome or an incompletely assembled part of a chromosome. In someembodiments, the user provides a search query of syntax X₁:N₁-N₂, whereX₁ is an identity within a selected first chromosome or a selected firstcontig sequence, N₁ is a selected start position within the firstchromosome or the selected first contig sequence, and N₂ is a selectedend position within the first chromosome or the selected first contigsequence. In some embodiments, the user provides a search query ofsyntax X₁:N₁, where X₁ is an identity of a selected first chromosome ora selected first contig sequence, and N₁ is a number of nucleotides,beginning at the origin of the first chromosome or the selected firstcontig sequence.

In some embodiments, a user provides a search query of syntax Y₁, Y₂, .. . , Y_(N), where each Y₁ in Y₁, Y₂, . . . , Y_(N) is either analphanumeric identification of a selected gene, a selection of achromosomal region, or selection of a region of a contig sequence. Insome such embodiments, a first Y_(i) in Y₁, Y₂, . . . , Y_(N) is anidentity of a first chromosome or a first contig sequence having thesyntax X₁:N₁-N₂, where X₁ is an identity of the first chromosome or thefirst contig sequence, N₁ is a selected start position within the firstchromosome or the first contig sequence, and N₂ is a selected endposition within the first chromosome or the first contig sequence, and asecond Y_(i) in Y₁, Y₂, . . . , Y_(N) is an alphanumeric identificationof a selected gene. In other such embodiments, a first Y₁ in Y₁, Y₂, . .. , Y_(N) is an identity of a first chromosome or a first contigsequence having the syntax X₁:N₁-N₂, where X₁ is an identity of thefirst chromosome or the first contig sequence, N₁ is a selected startposition within the first chromosome or the first contig sequence, andN₂ is a selected end position within the first chromosome or the firstcontig sequence, and a second Y₁ in Y₁, Y₂, . . . , Y_(N) is analphanumeric identification of a selected gene. In some embodiments, therequest is converted, without human intervention, to genomic coordinatesby comparison of the request against one or more lookup tables thatmatch alphanumeric entries of genes to genomic coordinates. In someembodiments, the request comprises one or more gene names, one or moregenomic coordinates, or a combination thereof.

Advantageously, the haplotype visualization tool 148 can be invoked in avariety of different system topologies. For instance, referring to FIG.31, in some embodiments, the haplotype visualization tool 148 operateson a client computer 3102 and accesses the nucleic acid sequence datasetremotely across a network connection. For instance, referring to FIG.31, in some embodiments, the haplotype visualization tool 148 tool is ona client computer system 3102 that communicates with the structuralvariation and phasing visualization system 100 across a networkconnection 3106. One such embodiment of the present disclosure providesa system 3100 for providing structural variation or phasing information3100 over a network connection to a remote client computer 3102.Referring to FIGS. 1 and 32, the system 3100 comprises a server 100having one or more microprocessors 102, a persistent memory (e.g., harddrive) and a non-persistent memory (e.g., random access memory). One ofskill in the art will appreciate that persistent memory is memory thatstores information even when system 100 is powered down whereasnon-persistent memory is not able to store information when system 100is powered down. Moreover, one of skill in the art will appreciate thataccess times to data stored in persistent memory is slower than accesstimes to data stored in non-persistent memory. Further still,non-persistent memory is more expensive than persistent memory. As such,the disclosed nucleic acid datasets 126, which are large, are typicallyrelegated to storage in persistent memory. In some embodiments, anucleic acid sequencing dataset is 1 gigabyte or larger, 5 gigabytes orlarger, or 10 gigabytes or larger.

In some embodiments, the persistent memory and the non-persistentmemory, collectively referenced as memory 112 in FIG. 1, store one ormore nucleic acid sequence datasets 126. Each respective nucleic acidsequencing dataset 126 in the one or more nucleic acid sequence datasetscorresponds to at least one target nucleic acid in a respective samplein a plurality of samples. The respective sample is associated with agenome of a species. Referring to FIG. 3, the respective nucleic acidsequencing dataset 126 comprises (i) a header 302, (ii) a synopsis 308,and (iii) a data section 340.

The data section 340 comprises a plurality of sequencing reads and isthe largest component of the dataset 126. Each respective sequencingread in the plurality of sequencing reads comprises a first portion thatcorresponds to a subset of at least one target nucleic acid in therespective sample and a second portion that encodes a respectiveidentifier for the respective sequencing read in a plurality ofidentifiers. Each respective identifier is independent of the sequenceof the at least one target nucleic acid. The plurality of sequencingreads collectively includes the plurality of identifiers.

The persistent memory and the non-persistent memory further collectivelystore one or more programs that use the one or more microprocessors 102to provide a haplotype visualization tool 148 to the client forinstallation on the remote client computer. In turn, a request, sentfrom the client over the network connection, is received for structuralvariation or phasing information using a first dataset 126 in the one ormore datasets. Responsive to receiving the request, the request isautomatically filtered by loading the header 302 and the synopsis 308 ofthe first dataset into the non-persistent memory if not already loadedinto the non-persistent memory while retaining the data section 340 inpersistent memory. In this way, the amount of non-persistent memory isminimized. The request is compared to the synopsis 308 of the firstdataset thereby identifying one or more portions of the data section ofthe first dataset. In particular, the various components of the synopsis308, as described in further detail below, are used to identify whichportions of the data 340 are needed to fulfill the request. In someembodiments, the request identifies a particular dataset 126 and aregion of a genome. In some embodiments, the request identifies aparticular dataset 126 and one or more genes. In some embodiments, therequest identifies a particular dataset 126 and one or more exons. Oncethe portions of the data section that are needed to fulfill the requestare identified, they are loaded into non-persistent memory and therequested structural variation or phasing information is formatted fordisplay on the client computer 3102 using the first dataset. Thisformatted structural variation or phasing information is then sent overthe network connection 3106 to the client device for display on theclient device. In some embodiments, as disclosed in FIG. 1, a clientcomputer is not used and the haplotype visualization tool is resident onthe structural variation and phasing visualization system 100.

Now that advantages of splitting up the nucleic acid sequence dataset126 have been explained, graphical user interface features of thehaplotype visualization tool 148, and its component modules (e.g.,summarization module 150, phase visualization module 152, structuralvariations module 154, etc.) will be described in further detail.Turning to FIG. 12, once a user has entered a query in panel 1250 phasevisualization module 152 may be used to view the phase of the query asillustrated in FIGS. 14 through 16. For instance, upon entering thequery chr1+10000000−chr1+10500000 (or chr1:10000000-chr1:10500000), theselected region is illustrated in the genome browser (phasevisualization module 152) illustrated in FIG. 14A. Here, the selectedregion of the genome is advantageously shown in a way that reflects theactual physical structure of the selected region: there are two copiesof the genome, and this is reflected by showing two tracks, one for eachhaplotype—haplotype 1 (1402) and haplotype 2 (1404), and a middle area1406 where the parental haplotype has not been determined. Smallinsertions and deletions are mapped to each haplotype based on phasingalgorithms. Portions of the selected region that have been phased to thefirst haplotype are shown as bars in the corresponding portion of thefirst haplotype 1 region 1402, portions of the selected region that havebeen phased to the second haplotype are shown as bars in thecorresponding portion of the second haplotype 1 region 1404, andportions of the selected region that have not been phased to a haplotypeare shown as bars in the middle area 1406.

In the haplotype view, phased portions of the selected region areenclosed in black rectangular boxes 1440. The entire region illustratedin FIG. 14A is in a single phase block 1440-1. This also the case forFIG. 14B, FIG. 15, and chromosomes 1 and 2 of FIG. 16. However, thedisplayed region of chromosome 4 in FIG. 16 includes five differentphase blocks, each demarked by a black rectangular box. These boxesdemarcate phased blocks, a contiguous phased region of the chromosome asdetermined by phasing algorithms.

Vertical bars in the haplotype 1 (1402), haplotype 2 (1404), and middlearea 1406 represent single nucleotide polymorphisms, small insertionsand deletions. In some embodiments, these bars are color coded with afirst color (e.g. grey) representing the reference genotype, and asecond color (e.g., green) representing the alternative genotype.

A homozygous SNP will have a vertical bar spanning the two haplotypetracks and the middle area (unphased track) since homozygous variantscannot be phased. This is illustrated as element 2602 in FIG. 26.

Phased heterozygous SNPs are placed on the haplotype tracks 1402/1404.This is illustrated as element 2604 in FIG. 26.

Heterozygous SNPs are placed in the middle area 1405 (unphased track)sandwiched in between the haplotype tracks 1402/1404 when they are notphased. This is illustrated as element 2606 in FIG. 26.

Finally, if both phased single nucleotide polymorphisms are ofalternative genotype, two vertical bars of the second color (e.g.,green) will be displayed in the haplotype tracks 1402/1404, one for eachtrack. This is illustrated as element 2608 in FIG. 26.

Dark regions, such as region 2710 of FIG. 27, of the haplotype trackrepresent areas with high SNP density. Clicking on a region 2710 zoomsinto individual SNPs within the region 2710. Furthermore, in someembodiments, when this is done, a pop-up box 2712 will appear with alink allowing the user to zoom in on the SNP group. In general, the box2712 provides additional information on the SNP, such as position, thereference genotype, observed genotypes of haplotype 1 and 2 in thesample, the gene where SNP is found (if associated with a gene), phasingquality, and allele counts of the two observed genotypes. The box 2712can be dismissed by clicking on an X on a corner of the box. In someembodiments, the phasing quality provided for the SNP is a Phred-likescore used to quantify the phasing quality of a SNP.

Referring to FIG. 28A, when a user clicks on one of the alleles for avariant, a rectangular box (e.g., rectangular box 2802) highlights thatvariant. The number 2804 displayed next to the highlighted variantrepresents the number of barcodes that are associated with the selectedallele for that variant. For instance, in FIG. 28A, the number “31” isdisplayed next to box 2802 indicating that the number of barcodes thatare associated with the selected allele for that variant is 31. Thereare also numbers displayed on the top and/or bottom of variants adjacentto box 2802. Each such number represents the number of barcodes thatoverlap between the selected allele and one of the two alleles of theadjacent variants. Numbers displayed in a first color (e.g., black)agree with the phasing call of the variant 2802, while numbers displayedin a second color (e.g., red) disagree with the call. The greater thebarcode overlap there is between neighboring variants, the moreconfidence there is in the phasing of the variant. As an example, forthe reference call at Chr7: 117,216,030 of FIG. 28A, there is a 31(2804) on the top of the haplotype 1 panel 1402, indicating there are 31barcodes associated with the reference allele at that position.Referring to FIG. 28B, when the variant SNV at the same position 2802 isselected, 13 barcodes support the phasing and the labeled neighboringSNVs change as seen in FIG. 28B.

In some embodiments the genome browser further provides a chromosome map1424 and the location 1426 on the chromosome that is being displayed.Referring to FIG. 14A, at the top of the browser, a miniature chromosome1424 with the centromere marked by a dark rectangle is shown withchromosome bands marked by light rectangles. A triangle 1426 indicatesthe location currently in zoom, giving the user an overall view of theregion selected using search bar 1250 with respect to the rest of thechromosome.

The disclosed genome browser further provides a graphic representation1408 of each gene that is in the displayed genomic region. This genestrack 1408 displays annotated reference genes. Multiple genes can bedisplayed using the search bar 1250 by entering the genes of interest.The direction of each gene is indicated with arrows. Although notillustrated in FIG. 14A, exons are highlighted with dark shades. Thisfeature is illustrated in FIGS. 26-28. In some embodiments, overlappinggenes are shown on a maximum of three tracks in the genes track 1408 butmany genes may be displayed using the search bar.

The disclosed genome browser further provides a graphic representation1410 of exons that are in the displayed genomic region.

The disclosed genome browser further provides a coverage track 1412 forthe coverage in the displayed genomic region. Aligned sequence reads areshown on the coverage track. Each vertical bar in the coverage track1412 shows the average coverage-per-base for the area of the genomeunder the bar. The height is scaled such that maximum height is fourtimes the median coverage. In some embodiments, when a user clicks on aportion of the coverage track 1412, the mean reads per base pair andtotal number of reads is displayed in a coverage details pop-up blackbox for that portion of the coverage track.

The disclosed genome browser further provides a breakpoints track 1414in the displayed region. Structural variants including inter-chromosomaltranslocations, gene fusions, inversions and deletions are highlightedin the breakpoints track 1414. Structural variants are arbitrarilynumbered in the display. Structural variant call are indicated in afirst color (e.g., orange) in the breakpoints track 1414 and structuralvariant candidate are specified in a second color (e.g., grey) in thebreakpoints track 1414. To display structural variant breakpoint pairs,a user can click on the structural variant displayed for the gene, asillustrated in FIG. 29. The structural variant is displayed in thedetails box 2902. By selecting “Zoom in on this breakpoint” 2094 indetails box 2902, the other side of the breakpoint is brought up as anadditional haplotype track, zoomed to the breakpoints, as illustrated inFIG. 30.

Advantageously, what is not shown in some embodiments of the displaymode of the disclosed genome browser, illustrated in FIG. 14A, are basecalls, error rates, specific reads, and alignments. Rather, thedisclosed genome browser operate at a higher level in order to provide amore conceptual indication of what is going on in the selected regionand to provide this information in a way that is easy to understand. Forthis reason, some embodiments of the disclosed browser provide a displaymode, such as the display mode illustrated in FIG. 14A, in which all ofthe sequence read data is not shown.

Referring to FIG. 14A, zoom affordance 1420 can be used zoom into asubset of the region identified by search bar 1250 and zoom affordance1422 can be used to zoom out of the region. In addition, a user can zoomin to a specific gene by clicking on the icon in region 1408representing the specific gene.

In some embodiments, the search bar 1250 of the disclosed genome browserprovides intelligent auto complete features. For instance, when a userstarts typing a gene name in the search bar 1250, the genome browserauto completes on the genes. In some embodiments, the genome browseraccomplishes this by comparing partial search queries that the userenters against genomic information stored in the nucleic acid sequencingdataset such as the names of genes in the gene track. Advantageously, insuch embodiments the search bar 1250 auto completes on gene names. Forinstance, referring to FIG. 17, when a user enters the expression “atp”into the search bar, several possible matches 1702-1 through 1702-10found within the nucleic acid sequence dataset 126 are displayed.

As illustrated in FIGS. 12 through 30, the haplotype visualization tool148 provides structural variation or phasing (e.g. haplotype)information for a nucleic acid sequence dataset.

In particular, referring to FIGS. 12 and 13, selection of thephasing/haplotypes toggle 1252 of the haplotype visualization tool 148invokes the phase visualization module 152 as illustrated in FIGS. 14-17and FIGS. 26-30. As illustrated in FIGS. 14-17 and FIGS. 26-30, visuallyseparated tracks for haplotypes as well as a virtual track for variantsthat could not be assigned to either haplotype is provided. Phasedvariants can have a wide number of classifications including: unphased,homozygous, and/or heterozygous-with-no-reference-reads,heterozygous-with-reference-reads. The haplotype visualization tool 148applies visually distinct stylings to these different configurations sothat a user can quickly tell them apart. The haplotype visualizationtool 148 can display the amount of barcode evidence used in assigning avariant to a particular phase block. In some embodiments, when the user“clicks” on a variant, every other visible variant is decorated with thecount of barcodes that overlapped with the selected variant. Data thatcontradicts the called haplotype is highlighted. The haplotypevisualization tool 148 also allows the user to view multiple regions atonce. This is displayed as separate haplotype in different areas of thescreen. In this mode “counts” are shared between each displayed regionallowing the user to view barcodes overlaps between distant regions ofthe genome.

Again referring to FIGS. 12 and 13, selection of the structural variantstoggle 1254 of the haplotype visualization tool 148 invokes thestructural variants module 154 as illustrated in FIGS. 23-25 and 33-34.The matrix view provided by the structural variants module 154encompasses a method for visualizing candidate structural variants. Thevisualization works by quantifying two (possibly overlapping) regions ofthe genome (test nucleic acid data) into chunks of between 100 and10,000 base pairs per chunk. The number of shared barcodes between thereads in every pair of chunks is computed. The resulting matrix (withthe chunks from one region as the rows and the other region as thecolumns) can be displayed as a two dimensional image (heat map), asillustrated in FIGS. 23-25 and 33-34. In some embodiments, the color ofa pixel corresponds to number of distinct overlapping barcodes between aspecific chunk (e.g. window) of each region. For example, consider tworegions with consecutive chunks with the following barcodes:

(1) AAA, ACA ACA, AGT GTG

(2) GTG, AAA CCC ACA, AAA

There are nine pairs of chunks between region (1) and region (2) whichcan be placed in a matrix such as the one set forth below in Table 1.

TABLE 1 - matrix of pairs of chunks between region (1) and region (2).(1) (2) AAA, ACA vs GTG, AAA AAA, ACA vs CCC AAA, ACA vs ACA, AAA ACA,AGT vs GTG, AAA ACA, AGT vs CCC ACA, AGT vs ACA, AAA     GTG vs GTG, AAA    GTG vs CCC     GTG vs ACA, AAAComputing the overlap between the two sets of barcodes in each cellyields the values set forth in Table 2.

TABLE 2 - matrix values between region (1) and region (2). (1) (2) 1 0 20 0 1 1 0 0

Table 2 can be displayed by the structural variants module 154 as a heatmap which efficiently shows areas of low and high barcode correlation tothe user. In some embodiments, the structural variants module 154provides additional information, such as gene and exon boundariesoverlaid with the matrix to allow easy alignment of the data to knownplaces of interest. In some embodiments, the structural variants module154 also allows a textual copy of the matrix to be downloaded foranalysis with other computer programs. In some embodiments, the user mayadjust the region of the genome that is visualized in the structuralvariants module 154 by scrolling or zooming in real time. In someembodiments, the user can adjust the resolution (chunk size/window size)to avoid aliases or overload when looking at very small or very largeareas of the genome.

Some embodiments of the present disclosure provide a system 100 forviewing nucleic acid sequencing data (e.g., information obtained fromnucleic acid sequencing datasets 126). The system 100 comprises one ormore microprocessors 102 and a memory 112. The memory stores a nucleicacid sequence dataset 126 corresponding to at least one target nucleicacid in a sample. The memory further stores one or more programs (e.g.,the haplotype visualization tool 148) that use the one or moremicroprocessors to obtain the nucleic acid sequencing dataset thatcomprises a plurality of sequencing reads from a sample. Then, a requestis obtained from a user (e.g., through search bar 1250 of the haplotypevisualization tool 148 illustrated in FIGS. 12 and 13) that specifies agenomic region represented by the nucleic acid sequencing dataset.Advantageously, this request can be in any of the syntaxes disclosed inthe present disclosure. In some embodiments, the genomic region in therequest is an entire chromosome. In some embodiments, the genomic regionin the request is between 100 and 10000 bases of the chromosome. In someembodiments, the genomic region in the request is between 10 and 1×10⁵bases of the chromosome. In some embodiments, the genomic region in therequest is between 10 and 1×10⁶ bases of the chromosome. In someembodiments, the genomic region in the request is between 10 and 1×10⁷bases of the chromosome. In some embodiments the request is for a genein the genome of the sample. Responsive to obtaining the request, therequest is parsed by obtaining a plurality of sequencing reads 1048within the genomic region of the request from the nucleic acidsequencing dataset 126. Next, a scan window is run against the pluralityof sequencing reads thereby creating a plurality of windows, eachrespective window of the plurality of windows corresponding to adifferent region of the genomic region in the request and including anidentity of each identifier (e.g., bar code) of each sequencing read inthe different region of the genomic region in the nucleic acidsequencing dataset. Further, referring for example to FIG. 34, a twodimensional heat map 3312 that represents each possible window pair inthe plurality of windows is displayed. Each respective window pair isdisplayed in the two dimensional heat map as a color selected from acolor scheme based upon the number of identifiers in common in therespective window pair. It will be appreciated that window size willdepend on the amount of the genome the user has requested to visualize.In some embodiments, when the user has requested to visualize a smallregion of the genome, smaller windows sizes are used and when the userhas requested to visualize a larger region of the genome, larger windowsizes are used.

Referring to FIGS. 33 and 34, affordances 3302 and 3304 provide uniquetools to clarify the displayed information. First, selection of the“hide expected overlap” affordance 3302 causes the bar code overlapsignal that is expected from the genome being in a normal state, wherebar codes associated with reads that are next to each other because theyare supposed to be, to be hidden. Compare FIG. 33, with affordance 3302not selected, with FIG. 34, with affordance 3302 selected. The viewprovided when affordance 3302 is selected is intended to emphasize thoseparts of the genome that are now touching each other that areunexpected. For instance, this view highlights a structural variation, atrans location from one chromosome to another that, based on thereference genome, you wouldn't expect to be there but suddenly the barcodes now shows the association. As such, affordance 3302 activates afilter that hides the normal signal and highlights the unexpectedsignals. In other words, the number of identifiers in common inrespective window pairs is down-weighted to remove bar code signalsarising from bar codes that are expected to be proximate to each otherbased on the reference genome sequence. In some embodiments, the filterassociated with affordance 3302 considers the mean length of thefragments of the target nucleic acid that were sequenced (e.g. 50 kb).Bar codes that are within this threshold distance of the mean length offragments do not contribute to the heat map when affordance 3302 isactivated. In some embodiments, the filter is enabled by taking theentire set of bar codes in the nucleic acid sequencing dataset 126 thathave been aligned against a reference genome. Then, only those regionsalong the reference genome that exhibit a gap that is greater than themean fragment length displayed. As such, the affordance 3302 filter actto filter out the expected and highlights the differences between thebar code data and a reference genome.

Referring to affordance 3304, each respective sequence read 1048 ismapped to a location on a reference genome with a confidence value thatrepresents a probability that the respective sequence read was correctlymapped. The default is to only show data for sequence reads when thisconfidence value satisfies a stringent (high) threshold value so thatmisleading information is not displayed. But sometimes a user stillwants to see information for sequence reads that do not satisfy thestringent threshold confidence value. For instance, sometimes, when toomuch data is filtered out based on the confidence threshold unusualartifacts may appear in the heat map. For instance, regions of the heatmap will appear to have no data. In reality, such regions may be justregions where the confidence in the localization of sequence reads 1048is low (e.g., regions of the genome that exhibit extensive repeats). Todetermine whether there is actual no data (perhaps indicating anextensive structural variation) affordance 3304 allows the user toremove (or lower) the stringent threshold value and to permit thedisplay of data from sequence reads 1048 that have been mapped to thereference genome with lower confidence values. In this way, the user candetermined whether there is in fact a structural variation at sites thatwere missing data when the stringent threshold value was turned on orwhether the genomic region simply represents a region where theconfidence values for the sequence reads is low.

In a typical use case scenario associated with affordance 3304, sequencereads 1084 that that do not satisfy a quality threshold are discardedand so are not used to in downstream phasing algorithms and structuralvariation algorithms. The consequence of discarding such sequence readsis that it can introduce what looks like structure in the heat map plotillustrated in FIGS. 33 and 34. For instance, some regions of the mapmay lighten up and some lines may be introduced giving rise to thequestion of whether there something happening in the actual samplethat's causing this to change the signal. By selecting affordance 3304,the discarded reads are put back into the phasing and/or structuralvariation algorithms regardless of their quality score to see if thiscauses removal of the observed artifacts in the plot. In this way,artifacts of the data can be teased out so that when a region of theplot is missing, before and after applying affordance 3304, confidencethat the observed artifact represents an artifact (e.g., structuralvariation) in the at least one target nucleic acid in a respectivesample or an artifact arising from discarding data from sequence reads1048.

Referring to FIG. 34, the extent of barcode overlap between respectiveregions of the target nucleic acid is signified on a color scale 3406 bythe number of barcodes (from sequence reads localized to the respectiveregions of the target nucleic acid) that overlap. Thus, in someembodiments, a color scheme is used, with each particular color in thecolor scheme uniquely representing a certain number of overlappingbarcodes. For instance, if a first and second section of the targetnucleic acid have in common a first number of barcodes, the colorassociated with the first number in the color scheme is used torepresent the combination of the first and second section of the targetnucleic acid. As illustrated in FIG. 34, the X axis 3308 and Y axis 3310each represent the target nucleic acid and thus the coordinates of thefirst and second section of the target nucleic acid within the targetnucleic acid define an X,Y position in the two dimensional grid, and thecolor associated with the value of the first number of barcodes is usedto color this X,Y position in the two dimensional grid in accordancewith the color scheme. In some embodiments, when a first and secondsection of the target nucleic acid have no barcodes in common, the colorscheme dictates that the color used for the X,Y position that representsthe combination of the first and second section of the target nucleicacid be white. In some embodiments, when a first and second section ofthe target nucleic acid have only a few barcodes in common (e.g, invarious embodiments, only one barcode in common, only two barcodes incommon, only three barcodes in common, only four barcodes in common oronly five barcodes in common), the color scheme dictates that the colorused for the X,Y position that represents the combination of the firstand second section of the target nucleic acid be grey. That is, in suchembodiments, the first position in the color scheme is white, meaning noshared barcodes and the second position in the color scheme is grey,meaning a minimal set of barcodes in common. In some embodiments, thereare 10 different values in the color scheme corresponding to 10different values of shared sequence reads. In some embodiments, thereare 11 different values in the color scheme corresponding to 11different values of shared sequence reads. In some embodiments, thereare 12 different values in the color scheme corresponding to 12different values of shared sequence reads. In some embodiments, thereare 13 different values in the color scheme corresponding to 13different values of shared sequence reads. In some embodiments, thereare 14 different values in the color scheme corresponding to 14different values of shared sequence reads. In some embodiments, thereare 15 different values in the color scheme corresponding to 15different values of shared sequence reads. In some embodiments, thereare between five and one hundred different values in the color schemecorresponding to between five and one hundred different values of sharedsequence reads.

Referring to FIG. 34, affordance 3308 can be used to pan (translationalmovement of) the view initially selected by search field 1250 so thatdifferent regions of the reference genome can be viewed. Referring toFIG. 34, affordance 3310 can be used to zoom the view initially selectedby search field 1250 so that different amounts the reference genome canbe viewed.

In some embodiments, the different views offered (e.g., haplotype/phase152, structural variants 154, and reads 156) by the haplotypevisualization tool 148 are all linked. For instance, a user may navigatefrom one view to another to see the same data using an alternatevisualization without reentering information using affordances 1252,1254, and 1256. For instance, the user may toggle between the matrixview of the structural variants module 154 and the haplotype view of thephase visualization module 152.

A “smart” search affordance 1250 is employed in the various views.Referring to FIG. 17, as a user types in the search affordance 1250, theprogram will attempt to auto-complete the partial query with real genenames or other forms of chromosomal locations in real time. In someembodiments, each time the user enters another character in the searchaffordance 1250, the partial query in the search affordance 1250 isqueried against a lookup table in the subject nucleic acid sequencingdataset 126. In some embodiments, this lookup table is the gene track320 and/or the exon track 322. Advantageously, in some embodiments, thehaplotype visualization tool 148 maintains a history of past userqueries. Thus, when a user starts to enter a new query, matches (orpartial matches) against former queries are also displayed to the userfor selection. This is particularly useful given the complex querysyntax that is supported by the search bar 1250 in some embodiments. Forexample, as discussed above a user may query for multiple regions atonce by separating queries with a variety of punctuators. A user mayalso enter a genomic coordinate directly in a number of formats.

In some embodiments, system 100 stores genomic data to be displayed in acustom file format (e.g., the format of nucleic acid sequencing dataset126). The file is generated by a “preprocessor” which takes referencedata, the VCF file, the BAM, file and the structural variant file asinputs and produces a single output nucleic acid sequencing dataset 126.The nucleic acid sequencing dataset 126 contains all of the informationthat is required to display a given dataset. The file is organized intoseveral sections. A small synopsis section 308 that is roughly 25 MB anda much larger data section 340 (100 MB to 20 GB). These sections arefurther subdivided as described above. When the nucleic acid sequencingdataset 126 is loaded, it loads just the index section into memory.System 100 uses that data to find appropriate ranges of the data sectionto load into memory on-demand. Variant calls and read information isstored in the data section, the rest of the data loupe needs is smallenough to store in the index section.

The data section is organized to chunks which are about ˜250 KB in someembodiments. When system 100 requires information stored in the datasection it consults the relevant index in the synopsis section (e.g.,gene track, exon track, etc.) to find the chunk that should have thedata and loads the entire chunk into memory. In some embodiments, thechunks for variant data are JSON-encoded structures containing thevariant data as well as the supporting barcode information. In someembodiments, the chunks for read data have an array of small (8-byte)data structures in which each structure contains the position, length,and barcode of a single read. In some embodiments, both variant and readdata is sorted by genomic position so that in general, system 100 willmake only a small number of on-disk reads to acquire all of the data itneeds to display a given subset of the data. In some embodiments, therest of the data that system 100 needs for visualization (such as thelocation of genes, structural variant breakpoints, etc) is stored in theindex (synopsis) section of the nucleic acid sequencing dataset 126 fileas an “itree”. An itree is an implementation of an interval tree. It isa reusable data structure (usually encoded in JSON) for annotatingranges of the genome. Thus exons, genes, phase blocks, and structuralvariant breakpoints are all encoded with the same mechanism even thoughthey are displayed differently.

Plural instances may be provided for components, operations orstructures described herein as a single instance. Finally, boundariesbetween various components, operations, and data stores are somewhatarbitrary, and particular operations are illustrated in the context ofspecific illustrative configurations. Other allocations of functionalityare envisioned and may fall within the scope of the implementation(s).In general, structures and functionality presented as separatecomponents in the example configurations may be implemented as acombined structure or component. Similarly, structures and functionalitypresented as a single component may be implemented as separatecomponents. These and other variations, modifications, additions, andimprovements fall within the scope of the implementation(s).

It will also be understood that, although the terms “first,” “second,”etc. may be used herein to describe various elements, these elementsshould not be limited by these terms. These terms are only used todistinguish one element from another. For example, a first object couldbe termed a second object, and, similarly, a second object could betermed a first object, without changing the meaning of the description,so long as all occurrences of the “first object” are renamedconsistently and all occurrences of the “second object” are renamedconsistently. The first object and the second object are both objects,but they are not the same object.

The terminology used herein is for the purpose of describing particularimplementations only and is not intended to be limiting of the claims.As used in the description of the implementations and the appendedclaims, the singular forms “a”, “an” and “the” are intended to includethe plural forms as well, unless the context clearly indicatesotherwise. It will also be understood that the term “and/or” as usedherein refers to and encompasses any and all possible combinations ofone or more of the associated listed items. It will be furtherunderstood that the terms “comprises” and/or “comprising,” when used inthis specification, specify the presence of stated features, integers,steps, operations, elements, and/or components, but do not preclude thepresence or addition of one or more other features, integers, steps,operations, elements, components, and/or groups thereof.

As used herein, the term “if” may be construed to mean “when” or “upon”or “in response to determining” or “in accordance with a determination”or “in response to detecting,” that a stated condition precedent istrue, depending on the context. Similarly, the phrase “if it isdetermined (that a stated condition precedent is true)” or “if (a statedcondition precedent is true)” or “when (a stated condition precedent istrue)” may be construed to mean “upon determining” or “in response todetermining” or “in accordance with a determination” or “upon detecting”or “in response to detecting” that the stated condition precedent istrue, depending on the context.

The foregoing description included example systems, methods, techniques,instruction sequences, and computing machine program products thatembody illustrative implementations. For purposes of explanation,numerous specific details were set forth in order to provide anunderstanding of various implementations of the inventive subjectmatter. It will be evident, however, to those skilled in the art thatimplementations of the inventive subject matter may be practiced withoutthese specific details. In general, well-known instruction instances,protocols, structures and techniques have not been shown in detail.

The foregoing description, for purpose of explanation, has beendescribed with reference to specific implementations. However, theillustrative discussions above are not intended to be exhaustive or tolimit the implementations to the precise forms disclosed. Manymodifications and variations are possible in view of the aboveteachings. The implementations were chosen and described in order tobest explain the principles and their practical applications, to therebyenable others skilled in the art to best utilize the implementations andvarious implementations with various modifications as are suited to theparticular use contemplated.

1-23. (canceled)
 24. A system for providing structural variation andphasing information, the system comprising one or more microprocessors,a persistent memory and a non-persistent memory that collectively storeone or more programs that use the one or more microprocessors to performa method of: obtaining a request for structural variation and phasinginformation in a nucleic acid sequencing dataset, wherein the nucleicacid sequencing dataset represents at least one target nucleic acid in asample associated with a genome of at least one species, the nucleicacid sequencing dataset comprises (i) a header, (ii) a synopsis, and(iii) a data section, the data section comprises a plurality ofsequencing reads, each respective sequencing read in the plurality ofsequencing reads comprises a nucleic acid sequence comprising a firstportion that corresponds to a subset of a target nucleic acid in the atleast one target nucleic acid and a second portion that encodes arespective identifier for the respective sequencing read in a pluralityof identifiers, each respective identifier is independent of eachsequence of the at least one target nucleic acid, the plurality ofsequencing reads collectively include the plurality of identifiers, andthe nucleic acid sequence dataset is 1 gigabyte or greater in size;responsive to obtaining the request, automatically parsing the requestby: (i) loading the header and the synopsis of the nucleic acidsequencing dataset into the non-persistent memory if not already loadedinto the non-persistent memory while retaining the data section inpersistent memory, (ii) comparing the request to the synopsis of thenucleic acid sequencing dataset thereby identifying one or more portionsof the data section of the nucleic acid sequencing dataset, (iii)loading the one or more identified portions of the data section intonon-persistent memory, wherein the loading loads less than the entiretyof the data section, and (iv) formatting structural variation andphasing information for display using the nucleic acid sequencingdataset.
 25. The system of claim 24, wherein the header delineates aplurality of components in the nucleic acid sequencing dataset.
 26. Thesystem of claim 25, wherein the plurality of components comprises two ormore components selected from the group consisting of a summary, anindex to variant call data, a phase block track, a refseq index track, agene track, an exon track, an index to read data, a structural variantdataset track, an index to a target dataset, and an index to a fragmentdataset.
 27. The system of claim 26, wherein the plurality of componentscomprises the summary and wherein the summary comprises two or moreitems in the group consisting of: a percentage of known SNPs phased inthe respective nucleic acid sequencing dataset, a longest phase block inthe respective nucleic acid sequencing dataset, a number of uniquebarcodes used in the respective nucleic acid sequencing dataset, anaverage fragment length in the respective nucleic acid sequencingdataset, a mean of the average fragment length in the respective nucleicacid sequencing dataset, a percentage of fragments greater than a lowerthreshold in the respective nucleic acid sequencing dataset, a fragmentlength histogram in the respective nucleic acid sequencing dataset, anN50 phase block size in the respective nucleic acid sequencing dataset,a phase block histogram in the respective nucleic acid sequencingdataset, a number of sequence reads represented by respective thenucleic acid sequencing dataset, a median insert size in the respectivenucleic acid sequencing dataset, a median depth in the respectivenucleic acid sequencing dataset, a percent of the target genome withzero coverage in the respective nucleic acid sequencing dataset, amapped reads percentage for the respective nucleic acid sequencingdataset, a PCR duplication percentage for the respective nucleic acidsequencing dataset, a coverage histogram for the in the respectivenucleic acid sequencing dataset, an identity of a test nucleic acid thatforms the basis for the respective nucleic acid sequencing dataset, agenome source for the respective nucleic acid sequencing dataset, a sexof an organism that originated the at least one test nucleic acid of therespective nucleic acid sequencing dataset, a sex of the organism thatoriginate the respective sample of the in the respective nucleic acidsequencing dataset, a dataset file format version of the in therespective nucleic acid sequencing dataset, and a pointer to a pluralityof structural variant calls made for the respective nucleic acidsequencing dataset.
 28. The system of claim 26, wherein the plurality ofcomponents comprises the index to variant call data that provides acorrespondence between respective ranges of a genome of the at least onespecies to offsets in the data section where variant call data for therespective ranges is found.
 29. The system of claim 26, wherein theplurality of components comprises the phase block track and wherein thephase block track comprises (i) a dictionary and (ii) a track datasection comprising phase information for one or more chromosomes in agenome of the at least one species.
 30. The system of claim 26, whereinthe plurality of components comprises the refseq index track, whereinthe refseq index track comprises an index of a plurality of molecularvariation identifiers that are called in the sample.
 31. The system ofclaim 26, wherein the plurality of components comprises the gene trackand wherein the gene track comprises (i) a gene track dictionary and(ii) a gene track data section.
 32. The system of claim 26, wherein theplurality of components comprises the index to read data wherein theindex to read data comprises a lookup table between a respectiveidentifier in the plurality of identifiers and a shortened version ofthe respective identifier.
 33. The system of claim 26, wherein theplurality of components comprises the structural variant dataset track,and the structural variant dataset track comprises (i) a dictionary and(ii) a track data section comprising structural variant call informationidentified in the plurality of sequencing reads.
 34. The system of claim33, wherein the dictionary comprises a plurality of names, and for eachrespective name in the plurality of names, an offset into the track datawhere records for the corresponding name are found.
 35. The system ofclaim 34, wherein the track data section comprises a plurality ofstructural variant records, and each structural variant record in theplurality of structural variant records represents a structural variantcall made in the at least one target nucleic acid in the sample.
 36. Thesystem of claim 35, wherein each respective structural variant record inthe plurality of structural variant records is represented by a node ina plurality of nodes in a respective interval tree in a plurality ofinterval trees, and each interval tree in the plurality of intervaltrees represents a chromosome in a plurality of chromosomes for thespecies.
 37. The system of claim 36, wherein the plurality of componentscomprises the index to the target dataset, the target dataset comprisesthe regions of the at least one target nucleic acid in the sample thatwere selected for sequencing in the respective nucleic acid sequencingdataset, the target dataset is indexed by a target dataset index storedin the synopsis, and the target dataset is stored in the data section.38. The system of claim 26, wherein the plurality of componentscomprises the index to the fragment dataset, the fragment datasetcomprises a length, chromosomal position, identifier, and phase of eachfragment of the at least one target nucleic acid in the sample, thefragment dataset is indexed by a fragment dataset index stored in thesynopsis, and the fragment dataset is stored in the data section. 39.The system of claim 24, wherein the request is for phasing informationin a region of a genome and the formatted phasing information includes agraphic representation comprising: a first haplotype track correspondingto a first parental haplotype of a first species in the at least onespecies in the region of the genome for the dataset, a second haplotypetrack, corresponding to a second parental haplotype of the first speciesin the region of the genome for the nucleic acid sequencing dataset, anindeterminate track corresponding to regions of the at least one nucleicacid sample that have not been assigned a parental haplotype in theregion of the genome for the nucleic acid sequencing dataset.
 40. Thesystem of claim 39, wherein the graphic representation further comprisesa graphic representation of each gene that is in the region of thegenome.
 41. The system of claim 39, wherein the graphic representationfurther comprises a coverage track for the region of the genome, whereinthe coverage track comprises a plurality of vertical bars, and whereineach respective vertical bar in the plurality of vertical bars indicatesan average coverage-per-base in the first dataset for a correspondingportion of the genome under the bar.
 42. The system of claim 24, whereinthe request is converted, without human intervention, to genomiccoordinates by comparison of the request against one or more lookuptables that match alphanumeric entries of genes to genomic coordinates.43. The system of claim 24, wherein the respective sample is associatedwith a genome of a plurality of species and includes at least a portionof the genome of a first species and a portion of the genome of thesecond species.
 44. The system of claim 24, wherein the request forstructural variation and phasing information in a nucleic acidsequencing dataset is in the form X₁:N₁-N₂, X₁ is an identity of aselected chromosome or a selected first contig sequence, N₁ is aselected start position within the first chromosome or the selectedfirst contig sequence, and N₂ is a selected end position within thefirst chromosome or the selected first contig sequence.
 45. The systemof claim 24, wherein the request for structural variation and phasinginformation in a nucleic acid sequencing dataset is in the form X₁:N₁,X₁ is an identity of a selected first chromosome or a selected firstcontig sequence, and N₁ is a number of nucleotides, beginning at theorigin of the first chromosome or the selected first contig sequence.46. The system of claim 24, wherein the request for structural variationand phasing information in a nucleic acid sequencing dataset is in theform Y₁, Y₂, . . . , Y_(N), wherein each Y_(i) in Y₁, Y₂, . . . , Y_(N)is either an alphanumeric identification of a selected gene, a selectionof a chromosomal region, or selection of a region of a contig sequence47. The system of claim 46, wherein a first Y_(i) in Y₁, Y₂, . . . ,Y_(N) is an identity of a first chromosome or a first contig sequencehaving a syntax X₁:N₁-N₂, X₁ is an identity of the first chromosome orthe first contig sequence, N₁ is a selected start position within thefirst chromosome or the first contig sequence, N₂ is a selected endposition within the first chromosome or the first contig sequence, and asecond Y_(i) in Y₁, Y₂, . . . , Y_(N) is an alphanumeric identificationof a selected gene.
 48. A method for providing structural variation andphasing information, the method comprising: obtaining a request forstructural variation and phasing information in a nucleic acidsequencing dataset, wherein the nucleic acid sequencing datasetrepresents at least one target nucleic acid in a sample associated witha genome of at least one species, the nucleic acid sequencing datasetcomprises (i) a header, (ii) a synopsis, and (iii) a data section, thedata section comprises a plurality of sequencing reads, each respectivesequencing read in the plurality of sequencing reads comprises a nucleicacid sequence comprising a first portion that corresponds to a subset ofa target nucleic acid in the at least one target nucleic acid and asecond portion that encodes a respective identifier for the respectivesequencing read in a plurality of identifiers, each respectiveidentifier is independent of each sequence of the at least one targetnucleic acid, the plurality of sequencing reads collectively include theplurality of identifiers, and the nucleic acid sequence dataset is 1gigabyte or greater in size; responsive to obtaining the request,automatically parsing the request at a computer system comprising aprocessor, non-persistent memory, and persistent memory, by: (i) loadingthe header and the synopsis of the nucleic acid sequencing dataset intothe non-persistent memory if not already loaded into the non-persistentmemory while retaining the data section in persistent memory, (ii)comparing the request to the synopsis of the nucleic acid sequencingdataset thereby identifying one or more portions of the data section ofthe nucleic acid sequencing dataset, (iii) loading the one or moreidentified portions of the data section into non-persistent memory,wherein the loading loads less than the entirety of the data section,and (iv) formatting structural variation and phasing information fordisplay using the nucleic acid sequencing dataset.
 49. A non-transitorycomputer readable storage medium for providing structural variation andphasing information, wherein the non-transitory computer readablestorage medium stores instructions, which when executed by a computersystem comprising non-persistent memory and persistent memory, cause thecomputer system to perform a method comprising: obtaining a request forstructural variation and phasing information in a nucleic acidsequencing dataset, wherein the nucleic acid sequencing datasetrepresents at least one target nucleic acid in a sample associated witha genome of at least one species, the nucleic acid sequencing datasetcomprises (i) a header, (ii) a synopsis, and (iii) a data section, thedata section comprises a plurality of sequencing reads, each respectivesequencing read in the plurality of sequencing reads comprises a nucleicacid sequence comprising a first portion that corresponds to a subset ofa target nucleic acid in the at least one target nucleic acid and asecond portion that encodes a respective identifier for the respectivesequencing read in a plurality of identifiers, each respectiveidentifier is independent of each sequence of the at least one targetnucleic acid, the plurality of sequencing reads collectively include theplurality of identifiers, and the nucleic acid sequence dataset is 1gigabyte or greater in size; responsive to obtaining the request,automatically parsing the request by: (i) loading the header and thesynopsis of the nucleic acid sequencing dataset into the non-persistentmemory if not already loaded into the non-persistent memory whileretaining the data section in persistent memory, (ii) comparing therequest to the synopsis of the nucleic acid sequencing dataset therebyidentifying one or more portions of the data section of the nucleic acidsequencing dataset, (iii) loading the one or more identified portions ofthe data section into non-persistent memory, wherein the loading loadsless than the entirety of the data section, and (iv) formattingstructural variation and phasing information for display using thenucleic acid sequencing dataset.